GeoVAE Proto β€” The Rosetta Stone Experiments

Text carries geometric structure. This repo proves it.

Three lightweight VAEs project text embeddings from different encoders into geometric patch space β€” and a pretrained geometric analyzer reads the text-derived patches more clearly than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.

The Hypothesis

If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were generated from text prompts, then the text embeddings should contain enough structural information to produce the same geometric differentiation β€” without ever seeing an image.

The Experiment

Text Prompt β†’ [Encoder] β†’ 512/768d embedding β†’ TextVAE β†’ (8, 16, 16) patches β†’ Geometric Analyzer β†’ gates + patch features

Three encoders tested against the same pipeline:

Directory Encoder Dim Pooling Architecture
text_vae/ flan-t5-small 512 mean pool encoder-decoder
bert_vae/ bert-base-uncased 768 [CLS] token bidirectional MLM
beatrix_vae/ bert-beatrix-2048 768 mean pool nomic_bert + categorical tokens

Each VAE has identical architecture: encoder (text_dim β†’ 1024 β†’ 1024) β†’ ΞΌ,Οƒ (256d bottleneck) β†’ decoder (256 β†’ 1024 β†’ 1024 β†’ 2048) β†’ reshape (8, 16, 16). Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.

The geometric analyzer is a pretrained SuperpositionPatchClassifier from AbstractPhil/grid-geometric-multishape (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ—17 explicit geometric properties) and patch features (64Γ—256 learned representations) from any (8, 16, 16) input.

Dataset: 49,286 images from AbstractPhil/synthetic-characters (schnell_full_1_512), 15 generator_type categories.

Results

Overall Discriminability (within-category similarity βˆ’ weighted between-category similarity)

Representation Image Path (49k) T5 (512d) BERT (768d) Beatrix (768d)
patch_feat +0.0198 +0.0526 +0.0534 +0.0502
gate_vectors +0.0090 +0.0311 +0.0319 +0.0302
global_feat +0.0084 +0.0228 +0.0219 +0.0214

All three text paths produce 2.5–3.5Γ— stronger geometric differentiation than the image path. All three encoders converge to Β±5% of each other.

Per-Category Discriminability (patch_feat)

Category Image T5 BERT Beatrix
character_with_lighting +0.051 +0.145 +0.093 +0.069
action_scene +0.020 +0.123 +0.126 +0.060
character_with_jewelry +0.048 +0.072 +0.107 +0.121
character_with_expression +0.041 +0.092 +0.066 +0.088
character_in_scene +0.014 +0.081 +0.062 +0.089
character_full_outfit +0.025 +0.080 +0.088 +0.054
character_with_pose +0.001 +0.007 -0.008 +0.007

Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.

Key Findings

  1. Text-derived patches are geometrically cleaner than image-derived patches. Language is already an abstraction β€” it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.

  2. The bridge is encoder-agnostic. Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.

  3. Categorical pretraining doesn't help overall. Beatrix, trained on 2B+ samples with explicit <lighting>, <jewelry>, <pose> tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).

  4. The 256d bottleneck is the normalizer. It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.

Architecture

Each VAE (~4.5M params):

Encoder:  text_dim β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
                   β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
          1024 β†’ ΞΌ (256d)
          1024 β†’ log_var (256d)

Bottleneck: z = ΞΌ + Ρ·σ  (training)
            z = ΞΌ          (inference)

Decoder:  256 β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
              β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
              β†’ Linear(2048)
          reshape β†’ (8, 16, 16)

Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.

Usage

from model import TextVAE  # or BertVAE, BeatrixVAE

# Load trained VAE
vae = TextVAE(text_dim=512)  # 768 for BERT/Beatrix
ckpt = torch.load("best_model.pt")
vae.load_state_dict(ckpt["model_state_dict"])

# Text β†’ geometric patches
text_embedding = your_encoder(prompt)        # (B, 512/768)
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)

# Feed to geometric analyzer
geo_output = geometric_model(patches)
gates = geo_output["local_dim_logits"]       # geometric properties
features = geo_output["patch_features"]       # learned representations

Implications

Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β€” a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.

File Structure

geovae-proto/
β”œβ”€β”€ text_vae/          # flan-t5-small (512d)
β”‚   β”œβ”€β”€ model.py       # TextVAE architecture
β”‚   β”œβ”€β”€ train.py       # Extract + train + analyze
β”‚   └── push.py        # Upload to HF
β”œβ”€β”€ bert_vae/          # bert-base-uncased (768d)
β”‚   β”œβ”€β”€ model.py
β”‚   β”œβ”€β”€ train.py
β”‚   └── push.py
└── beatrix_vae/       # bert-beatrix-2048 (768d)
    β”œβ”€β”€ model.py
    β”œβ”€β”€ train.py
    └── push.py

Citation

Part of the geometric deep learning research by AbstractPhil. Built on the geometric analyzer from grid-geometric-multishape and the synthetic-characters dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AbstractPhil/geovae-proto

Finetuned
(1)
this model

Dataset used to train AbstractPhil/geovae-proto