Papers
arxiv:2604.08546

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Published on Apr 9
· Submitted by
Dingkang Liang
on Apr 10
#2 Paper of the day
Authors:
,
,
,

Abstract

NUMINA enhances text-to-video diffusion models' numerical accuracy through a training-free framework that identifies layout inconsistencies and guides regeneration via attention modulation.

AI-generated summary

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

Community

Paper author Paper submitter

NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models — the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested). Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers. It constructs a countable spatial layout via a two-stage clustering pipeline, then performs layout-guided attention modulation during regeneration to enforce the correct object count — all without retraining or fine-tuning. On our introduced CountBench, this attention-level intervention provides principled, interpretable control over numerical semantics that seed search and prompt enhancement fundamentally cannot achieve, improves counting accuracy by up to 7.4% on Wan2.1-1.3B. Furthermore, because NUMINA operates partly orthogonally to inference acceleration techniques, it is compatible with training-free caching methods such as EasyCache, which accelerates diffusion inference via runtime-adaptive transformer output reuse.

the most interesting nugget for me is how NUMINA picks a per-token instance-separable self-attention head and a text-aligned cross-attention head to build a countable latent layout. that per-token head choice, then fusing maps into a foreground layout and refining by adding or removing instances with a decaying reg score, feels like a smart way to inject a hard count constraint without retraining. i’d be curious how robust that per-token head selection is when prompts use ambiguous nouns or heavy occlusion, or when objects are densely packed and attention becomes diffuse. btw the arxivlens breakdown helped me parse the method details, and it’s nice to see a concise walkthrough that lines up with what they describe here (https://arxivlens.com/PaperView/Details/when-numbers-speak-aligning-textual-numerals-and-visual-instances-in-text-to-video-diffusion-models-6669-5788ebc6).

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08546
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.08546 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08546 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08546 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.