Simple and Effective Masked Diffusion Language Models
Paper • 2406.07524 • Published • 12
A small Masked Diffusion Language Model (MDLM) trained on TinyStories.
| Property | Value |
|---|---|
| Architecture | DiT + adaLN-zero + RoPE + bidirectional attention |
| Parameters | 29.4M |
| Layers | 4 |
| Hidden dim | 256 |
| Heads | 4 |
| Context length | 128 tokens |
| Tokenizer | GPT-2 (50,257 + 1 mask token) |
| Training | MDLM (Rao-Blackwellized ELBO) |
| Dataset | TinyStories (50k examples subset) |
| Steps | 1500 |
| Best val loss | 7.8963 |
Unlike autoregressive LMs, MDLM generates text through iterative denoising:
[MASK] tokens Based on Simple and Effective Masked Diffusion Language Models (Sahoo et al., NeurIPS 2024).
import torch
from model import MDLMConfig, MDLM, sample
from transformers import AutoTokenizer
# Load model
model = MDLM.from_pretrained("youraveragedev/mdlm-tiny-stories", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("youraveragedev/mdlm-tiny-stories")
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Generate text (unconditional)
generated_ids = sample(model, seq_len=128, batch_size=1, num_steps=100, temperature=0.7, device=device)
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)
From the MDLM paper (NeurIPS 2024):
t ~ U(0,1)1/t (ELBO)@inproceedings{sahoo2024simple,
title={Simple and Effective Masked Diffusion Language Models},
author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiu, Justin T and Rush, Alexander and Kuleshov, Volodymyr},
booktitle={NeurIPS},
year={2024}
}