Abstract
Cola DLM presents a hierarchical latent diffusion language model that uses text-to-latent mapping, global semantic prior modeling, and conditional decoding to achieve efficient text generation with flexible non-autoregressive inductive bias.
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
Community
the most interesting bit to me is the two-stage setup: a Text VAE to fix a stable text-to-latent mapping, then a block-causal diffusion transformer to model the latent prior. but i worry a bit about posterior collapse in the vAE and how the kl term plus the bert-style objectives balance fidelity vs compression, especially since the diffusion then operates in that latent space. an ablation on the latent block size and on gradient-stabilization steps would be really telling, i bet the granularity of blocks is not just a hyperparameter but a structural bottleneck. the arxivlens breakdown helped me parse the method details, btw, if you want a quick mental map: https://arxivlens.com/PaperView/Details/continuous-latent-diffusion-language-model-9852-239cae7d. curious how this holds up with longer prompts or multilingual data where the semantic prior might need to reorganize more aggressively.
Thanks a lot for the thoughtful comment! We fully agree that the two-stage design is one of the most critical parts of the method. The Text VAE is not just used as a preprocessing module, but is meant to establish a stable text-to-latent interface before the latent prior is learned. Regarding posterior collapse and the balance between reconstruction, KL regularization, and the BERT-style objective, one interesting observation we found is that text reconstruction itself is actually a relatively easy task in this setup: the reconstruction accuracy can quickly approach nearly 100%. This suggests that the latent representation space has a large degree of flexibility, and that there is still substantial room to study how this space should be organized, compressed, and made more semantically meaningful. In this sense, the VAE stage is not only about preserving fidelity, but also about shaping a useful latent carrier for subsequent prior modeling.
We also agree with your point that the latent block size is more than a simple hyperparameter. It effectively controls the granularity at which the model organizes semantic information, so it can become a structural bottleneck if chosen poorly. That is why we include ablations on different block sizes, and the results suggest that a moderate block size works better than either very fine-grained or overly coarse latent grouping.
For longer prompts, you can check our results on long-context understanding tasks such as RACE and SQuAD, where Cola shows encouraging performance compared with AR and other baselines. That said, we completely agree that extending this to much longer contexts is an important next step, especially when the latent prior needs to reorganize information more aggressively. Multilingual modeling is also a very interesting direction, since it directly tests whether the latent space is capturing language-independent semantics rather than surface token patterns.
Thanks again for the careful reading and for mentioning the arxivlens breakdown! We expect to release the code in around 1–2 weeks, and we would be very happy to have more people explore these questions together with us.
Thanks a lot for the thoughtful comment! We fully agree that the two-stage design is one of the most critical parts of the method. The Text VAE is not just used as a preprocessing module, but is meant to establish a stable text-to-latent interface before the latent prior is learned. Regarding posterior collapse and the balance between reconstruction, KL regularization, and the BERT-style objective, one interesting observation we found is that text reconstruction itself is actually a relatively easy task in this setup: the reconstruction accuracy can quickly approach nearly 100%. This suggests that the latent representation space has a large degree of flexibility, and that there is still substantial room to study how this space should be organized, compressed, and made more semantically meaningful. In this sense, the VAE stage is not only about preserving fidelity, but also about shaping a useful latent carrier for subsequent prior modeling.
We also agree with your point that the latent block size is more than a simple hyperparameter. It effectively controls the granularity at which the model organizes semantic information, so it can become a structural bottleneck if chosen poorly. That is why we include ablations on different block sizes, and the results suggest that a moderate block size works better than either very fine-grained or overly coarse latent grouping.
For longer prompts, you can check our results on long-context understanding tasks such as RACE and SQuAD, where Cola shows encouraging performance compared with AR and other baselines. That said, we completely agree that extending this to much longer contexts is an important next step, especially when the latent prior needs to reorganize information more aggressively. Multilingual modeling is also a very interesting direction, since it directly tests whether the latent space is capturing language-independent semantics rather than surface token patterns.
Thanks again for the careful reading and for mentioning the arxivlens breakdown! We expect to release the code in around 1–2 weeks, and we would be very happy to have more people explore these questions together with us.
Thank you all for your attention! The code will be open-sourced in about two weeks, and everyone is welcome to explore it together.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Consistent Diffusion Language Models (2026)
- The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook (2026)
- Semantic-Aware Prefix Learning for Token-Efficient Image Generation (2026)
- Diffusion Language Models for Speech Recognition (2026)
- LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling (2026)
- ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model (2026)
- Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Personally, I believe that representation is the central problem of this work.
What Cola DLM aims to emphasize is not merely diffusion itself, but a more fundamental perspective: must language modeling necessarily be bound to discrete tokens? Tokens are only the surface carriers of language, while what is truly more stable and transferable may be the continuous semantic representations hidden behind different token realizations.
Therefore, what we care more about is how to decompose text generation into two levels through a hierarchical modeling approach: one level is high-level semantic organization in continuous space, and the other is the realization of concrete token sequences. In other words, the VAE is responsible for establishing a semantic interface from text to latent space, the DiT is responsible for modeling the continuous latent prior, and the decoder then instantiates latent semantics into text. But from a more general perspective, the VAE is not the only choice; it can be replaced by other stronger representation models. Diffusion is also not the only choice; it can be replaced by other continuous distribution matching or prior modeling methods. What we truly want to explore is: does there exist an information carrier that is more efficient, more elegant, and closer to the essence of semantics than discrete tokens?
The compression experiments in the paper’s discussion section also provide a very interesting signal: when latent grouping is aligned with text boundaries, simple 2× compression does not necessarily harm generation, and may even lead to better results. This suggests that the token-level carrier may not be the optimal form of information load; a moderately compressed continuous latent may be more suitable for carrying global semantic structure, while leaving local expression details to the decoder. This is exactly the core point that Cola DLM tries to convey: we do not merely want to recover tokens, but to learn the semantic structures behind tokens.
Furthermore, this idea naturally points toward unified multimodal modeling. Representation learning in vision has already developed very maturely, whereas text has long been dominated more by tokenization. In the final section of the discussion, we provide a preliminary sample that attempts to place visual prior representations and textual prior representations into the same continuous space for unified modeling. Although this result is still very preliminary, it at least suggests that latent representations from different modalities are not completely isolated from each other; they may have the opportunity to become compatible and aligned with one another in a higher-level semantic space.
Of course, this is only a beginning. How representations from different modalities interact with each other, how to construct text latents that are more efficient, more abstract, and more semantic, and how to balance compression ratio, semantic fidelity, and the difficulty of prior modeling are all questions that are highly worth deeper exploration in the future.
Finally, we sincerely thank the community for its attention, discussion, and promotion of Cola DLM. We warmly welcome everyone to discuss it with us, and we sincerely hope to receive more feedback, suggestions, and criticism. This is a relatively radical attempt, and many problems are still far from being solved. Every comment from the community will be extremely helpful for shaping our future research directions. Thank you all!
Get this paper in your agent:
hf papers read 2605.06548 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper