Large Language Diffusion Models
Paper • 2502.09992 • Published • 127
A masked discrete diffusion model for English-to-German machine translation, trained from scratch on WMT14 EN-DE.
| Component | Detail |
|---|---|
| Type | Masked Discrete Diffusion |
| Backbone | DiT (Diffusion Transformer) with adaLN |
| Parameters | ~72M |
| Blocks | 12 DiT blocks |
| Hidden dim | 512, 8 attention heads |
| Attention | Bidirectional (no causal mask) with RoPE |
| Conditioning | Timestep via sinusoidal embeddings + adaLN; Segment embeddings for src/tgt |
| Weight tying | Input embeddings tied to output projection |
| Tokenizer | Helsinki-NLP/opus-mt-en-de (~58K vocab) |
| Max sequence | 128 src + 128 tgt tokens |
[source | target]t ~ Uniform(0, 1) is sampled per examplet1/t (continuous-time ELBO)[source | MASK MASK ... MASK]t → s: predict all masked tokens, randomly re-mask a fraction s/t| Setting | Value |
|---|---|
| Dataset | WMT14 EN-DE (~4.5M parallel sentence pairs) |
| Optimizer | AdamW (lr=3e-4, β₁=0.9, β₂=0.98, wd=0.01) |
| Schedule | Cosine with 4K linear warmup |
| Effective batch size | 256 (64 × 4 gradient accumulation) |
| Max steps | 200,000 |
| Mixed precision | FP16 |
| Gradient clipping | max_norm=1.0 |
| Evaluation | SacreBLEU on WMT14 test set every 20K steps |
pip install torch transformers datasets trackio sacrebleu sacremoses sentencepiece protobuf
git clone https://huggingface.co/vedkdev/text-diffusion-en-de
cd text-diffusion-en-de
python train.py
The script will:
Edit the TRAIN_CONFIG dict in train.py:
| GPU VRAM | Recommended batch_size |
gradient_accumulation_steps |
|---|---|---|
| 24GB (A10G/3090/4090) | 64 | 4 |
| 16GB (T4/V100) | 32 | 8 |
| 12GB (3060) | 16 | 16 |
| 8GB (3070) | 8 | 32 |
import torch, json
from train import DiffusionTranslator, DiffusionTranslatorConfig, generate
from transformers import AutoTokenizer
# Load checkpoint
config = DiffusionTranslatorConfig(**json.load(open("checkpoints/best/config.json")))
model = DiffusionTranslator(config)
model.load_state_dict(torch.load("checkpoints/best/model.pt", map_location="cpu"))
model.eval()
tokenizer = AutoTokenizer.from_pretrained("checkpoints/best/")
# Translate
text = "The weather is nice today."
src = tokenizer(f"translate English to German: {text}",
max_length=128, truncation=True, padding="max_length",
return_tensors="pt")
gen_ids = generate(model, src["input_ids"], torch.zeros_like(src["input_ids"]),
config, num_steps=50, device="cpu")
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))
Based on published literature for similar architectures on WMT14 EN→DE:
| Model | BLEU | Reference |
|---|---|---|
| Autoregressive Transformer | ~27 | Vaswani et al. |
| DiNoiSer (continuous diffusion) | 24.6 | Ye et al. 2023 |
| SeqDiffuSeq | 19.8 | Yuan et al. 2022 |
| E2D2 (discrete diffusion) | 24.8 | Kuleshov et al. 2024 |
| This model (target) | 15-20 | ~72M params, no KD |
Note: Text diffusion models typically score 2-5 BLEU below autoregressive transformers of similar size. Knowledge distillation (KD) from an AR teacher can close the gap by ~1-2 BLEU.
If you use this model, please cite the foundational papers:
@article{sahoo2024mdlm,
title={Simple and Effective Masked Diffusion Language Models},
author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Kuleshov, Volodymyr},
journal={NeurIPS},
year={2024}
}
@article{nie2025llada,
title={Large Language Diffusion Models},
author={Nie, Shen and Zhu, Fengqi and You, Chao and Zhang, Xiaojun and Ou, Zhenguo and Zhu, Jun},
journal={arXiv preprint arXiv:2502.09992},
year={2025}
}
@article{ye2023dinoiser,
title={DiNoiSer: Diffused Conditional Sequence Learning by Manipulating Noises},
author={Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan},
journal={ACL},
year={2023}
}