GeoVAE Proto β The Rosetta Stone Experiments
Text carries geometric structure. This repo proves it.
Three lightweight VAEs project text embeddings from different encoders into geometric patch space β and a pretrained geometric analyzer reads the text-derived patches more clearly than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.
The Hypothesis
If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were generated from text prompts, then the text embeddings should contain enough structural information to produce the same geometric differentiation β without ever seeing an image.
The Experiment
Text Prompt β [Encoder] β 512/768d embedding β TextVAE β (8, 16, 16) patches β Geometric Analyzer β gates + patch features
Three encoders tested against the same pipeline:
| Directory | Encoder | Dim | Pooling | Architecture |
|---|---|---|---|---|
text_vae/ |
flan-t5-small | 512 | mean pool | encoder-decoder |
bert_vae/ |
bert-base-uncased | 768 | [CLS] token | bidirectional MLM |
beatrix_vae/ |
bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens |
Each VAE has identical architecture: encoder (text_dim β 1024 β 1024) β ΞΌ,Ο (256d bottleneck) β decoder (256 β 1024 β 1024 β 2048) β reshape (8, 16, 16). Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.
The geometric analyzer is a pretrained SuperpositionPatchClassifier from AbstractPhil/grid-geometric-multishape (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ17 explicit geometric properties) and patch features (64Γ256 learned representations) from any (8, 16, 16) input.
Dataset: 49,286 images from AbstractPhil/synthetic-characters (schnell_full_1_512), 15 generator_type categories.
Results
Overall Discriminability (within-category similarity β weighted between-category similarity)
| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) |
|---|---|---|---|---|
| patch_feat | +0.0198 | +0.0526 | +0.0534 | +0.0502 |
| gate_vectors | +0.0090 | +0.0311 | +0.0319 | +0.0302 |
| global_feat | +0.0084 | +0.0228 | +0.0219 | +0.0214 |
All three text paths produce 2.5β3.5Γ stronger geometric differentiation than the image path. All three encoders converge to Β±5% of each other.
Per-Category Discriminability (patch_feat)
| Category | Image | T5 | BERT | Beatrix |
|---|---|---|---|---|
| character_with_lighting | +0.051 | +0.145 | +0.093 | +0.069 |
| action_scene | +0.020 | +0.123 | +0.126 | +0.060 |
| character_with_jewelry | +0.048 | +0.072 | +0.107 | +0.121 |
| character_with_expression | +0.041 | +0.092 | +0.066 | +0.088 |
| character_in_scene | +0.014 | +0.081 | +0.062 | +0.089 |
| character_full_outfit | +0.025 | +0.080 | +0.088 | +0.054 |
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 |
Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.
Key Findings
Text-derived patches are geometrically cleaner than image-derived patches. Language is already an abstraction β it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.
The bridge is encoder-agnostic. Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.
Categorical pretraining doesn't help overall. Beatrix, trained on 2B+ samples with explicit
<lighting>,<jewelry>,<pose>tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).The 256d bottleneck is the normalizer. It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.
Architecture
Each VAE (~4.5M params):
Encoder: text_dim β Linear(1024) β LN β GELU β Dropout
β Linear(1024) β LN β GELU β Dropout
1024 β ΞΌ (256d)
1024 β log_var (256d)
Bottleneck: z = ΞΌ + Ξ΅Β·Ο (training)
z = ΞΌ (inference)
Decoder: 256 β Linear(1024) β LN β GELU β Dropout
β Linear(1024) β LN β GELU β Dropout
β Linear(2048)
reshape β (8, 16, 16)
Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.
Usage
from model import TextVAE # or BertVAE, BeatrixVAE
# Load trained VAE
vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix
ckpt = torch.load("best_model.pt")
vae.load_state_dict(ckpt["model_state_dict"])
# Text β geometric patches
text_embedding = your_encoder(prompt) # (B, 512/768)
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)
# Feed to geometric analyzer
geo_output = geometric_model(patches)
gates = geo_output["local_dim_logits"] # geometric properties
features = geo_output["patch_features"] # learned representations
Implications
Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.
File Structure
geovae-proto/
βββ text_vae/ # flan-t5-small (512d)
β βββ model.py # TextVAE architecture
β βββ train.py # Extract + train + analyze
β βββ push.py # Upload to HF
βββ bert_vae/ # bert-base-uncased (768d)
β βββ model.py
β βββ train.py
β βββ push.py
βββ beatrix_vae/ # bert-beatrix-2048 (768d)
βββ model.py
βββ train.py
βββ push.py
Citation
Part of the geometric deep learning research by AbstractPhil. Built on the geometric analyzer from grid-geometric-multishape and the synthetic-characters dataset.
Model tree for AbstractPhil/geovae-proto
Base model
nomic-ai/nomic-bert-2048