data-archetype/dinac_ae
DINAC-AE is a DINO-Aligned Class-token AutoEncoder. It follows the SemDisDiffAE family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned representations.
Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
The latent-to-DINO alignment head is extended to predict the DINO class token
as well as patch tokens. predict_class(latents) exposes that class-token
feature directly from latents.
2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---|---|---|---|---|
| dinac_ae | 35.19 |
4.53 |
35.06 |
28.02 |
42.43 |
| FLUX.2 VAE | 36.28 |
4.53 |
36.07 |
28.89 |
43.63 |
Evaluated on 2000 validation images.
DINAC-AE targets a compromise between high reconstruction quality, a learnable latent space with KL-like variance expansion, DINOv3 alignment, and robustness to local token errors.
Results viewer
shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The released export recheck on that 39-image set gives 35.15 dB mean PSNR
(25.73 min, 45.99 max).
Encode Throughput
Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging repeated
batches per resolution.
| Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---|---|---|---|---|---|---|---|
256x256 |
128 |
50 |
383 |
1,637 |
12,511 |
7.62x |
86.9% |
512x512 |
32 |
53 |
354 |
1,639 |
12,511 |
6.72x |
86.9% |
The transformer encoder is slightly slower and larger than the full_capacitor FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE encoder.
Latent Interface
encode()returns DINAC-AE's own whitened latent space.decode()expects that same whitened latent space and dewhitens internally.predict_class()expects the same whitened latent space, dewhitens internally, and predicts a DINOv3 ViT-B/16 class-token feature.whiten()anddewhiten()are exposed for explicit control.encode_posterior()returns the raw exported posterior before whitening.DinacAEInferenceConfig.num_stepscounts decoder evaluations directly:num_steps=1means one NFE.
The export ships weights in float32. The recommended and default runtime path
is bfloat16 AMP for the main encoder, decoder, and class-token path, with
float32 retained for sensitive operations such as whitening/dewhitening,
normalization math, RoPE frequency construction, and VP diffusion schedule
helpers.
Usage
import torch
from dinac_ae import DinacAE, DinacAEInferenceConfig
device = "cuda"
model = DinacAE.from_pretrained(
"data-archetype/dinac_ae",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16
with torch.inference_mode():
latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
class_token = model.predict_class(latents)
recon = model.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=DinacAEInferenceConfig(num_steps=1),
)
Details
- DINAC-AE uses a
6-block ViT/DiT-style transformer encoder and an8-block FCDM decoder. - Patch size is
16, model width is896, and latent width is128. - The DINO alignment head predicts spatial patch tokens and is extended with a class-token output in DINOv3 ViT-B/16 feature space.
- The class-token output is used to improve semantic organization of the latent space and to support FD-loss / Representation Frechet Distance objectives directly in latent space.
predict_class(latents)reaches mean cosine similarity0.757458against the frozen DINOv3 ViT-B/16 teacher class token on the same2000images.- DINO alignment is applied directly to clean latent tokens. Robustness to local token errors is handled by random-token logSNR offset regularization.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
- Related: SemDisDiffAE, full_capacitor, capacitor_decoder
Citation
@misc{dinac_ae,
title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = may,
url = {https://huggingface.co/data-archetype/dinac_ae},
}
- Downloads last month
- 14