data-archetype/dinac_ae

DINAC-AE is a DINO-Aligned Class-token AutoEncoder. It follows the SemDisDiffAE family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned representations.

Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a 6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment. The latent-to-DINO alignment head is extended to predict the DINO class token as well as patch tokens. predict_class(latents) exposes that class-token feature directly from latents.

2k PSNR Benchmark

Model Mean PSNR (dB) Std (dB) Median (dB) P5 (dB) P95 (dB)
dinac_ae 35.19 4.53 35.06 28.02 42.43
FLUX.2 VAE 36.28 4.53 36.07 28.89 43.63

Evaluated on 2000 validation images.

DINAC-AE targets a compromise between high reconstruction quality, a learnable latent space with KL-like variance expansion, DINOv3 alignment, and robustness to local token errors.

Results viewer shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE reconstructions, RGB differences, and latent PCA. The released export recheck on that 39-image set gives 35.15 dB mean PSNR (25.73 min, 45.99 max).

Full technical report

Encode Throughput

Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging repeated batches per resolution.

Resolution Batch Size dinac_ae encode (ms/batch) FLUX.2 encode (ms/batch) dinac_ae peak VRAM (MiB) FLUX.2 peak VRAM (MiB) Speedup vs FLUX.2 Peak VRAM Reduction vs FLUX.2
256x256 128 50 383 1,637 12,511 7.62x 86.9%
512x512 32 53 354 1,639 12,511 6.72x 86.9%

The transformer encoder is slightly slower and larger than the full_capacitor FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE encoder.

Latent Interface

  • encode() returns DINAC-AE's own whitened latent space.
  • decode() expects that same whitened latent space and dewhitens internally.
  • predict_class() expects the same whitened latent space, dewhitens internally, and predicts a DINOv3 ViT-B/16 class-token feature.
  • whiten() and dewhiten() are exposed for explicit control.
  • encode_posterior() returns the raw exported posterior before whitening.
  • DinacAEInferenceConfig.num_steps counts decoder evaluations directly: num_steps=1 means one NFE.

The export ships weights in float32. The recommended and default runtime path is bfloat16 AMP for the main encoder, decoder, and class-token path, with float32 retained for sensitive operations such as whitening/dewhitening, normalization math, RoPE frequency construction, and VP diffusion schedule helpers.

Usage

import torch

from dinac_ae import DinacAE, DinacAEInferenceConfig


device = "cuda"
model = DinacAE.from_pretrained(
    "data-archetype/dinac_ae",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    class_token = model.predict_class(latents)
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=DinacAEInferenceConfig(num_steps=1),
    )

Details

  • DINAC-AE uses a 6-block ViT/DiT-style transformer encoder and an 8-block FCDM decoder.
  • Patch size is 16, model width is 896, and latent width is 128.
  • The DINO alignment head predicts spatial patch tokens and is extended with a class-token output in DINOv3 ViT-B/16 feature space.
  • The class-token output is used to improve semantic organization of the latent space and to support FD-loss / Representation Frechet Distance objectives directly in latent space.
  • predict_class(latents) reaches mean cosine similarity 0.757458 against the frozen DINOv3 ViT-B/16 teacher class token on the same 2000 images.
  • DINO alignment is applied directly to clean latent tokens. Robustness to local token errors is handled by random-token logSNR offset regularization.
  • Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
  • Related: SemDisDiffAE, full_capacitor, capacitor_decoder

Citation

@misc{dinac_ae,
  title   = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = may,
  url     = {https://huggingface.co/data-archetype/dinac_ae},
}
Downloads last month
14
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support