Text Diffusion Model for EN→DE Translation

A masked discrete diffusion model for English-to-German machine translation, trained from scratch on WMT14 EN-DE.

Architecture

Component Detail
Type Masked Discrete Diffusion
Backbone DiT (Diffusion Transformer) with adaLN
Parameters ~72M
Blocks 12 DiT blocks
Hidden dim 512, 8 attention heads
Attention Bidirectional (no causal mask) with RoPE
Conditioning Timestep via sinusoidal embeddings + adaLN; Segment embeddings for src/tgt
Weight tying Input embeddings tied to output projection
Tokenizer Helsinki-NLP/opus-mt-en-de (~58K vocab)
Max sequence 128 src + 128 tgt tokens

Inspired by

  • MDLM — DiT backbone architecture, masked diffusion objective
  • LLaDA — Conditional generation via SFT (keep prompt unmasked, mask only target), 1/t ELBO weighting
  • DiNoiSer — Noise manipulation for conditional seq2seq diffusion

How It Works

Training (Forward Diffusion)

  1. Source (EN) and target (DE) tokens are concatenated: [source | target]
  2. A random masking rate t ~ Uniform(0, 1) is sampled per example
  3. Each target token is independently masked with probability t
  4. The bidirectional DiT predicts all masked tokens simultaneously
  5. Loss = cross-entropy on masked positions only, weighted by 1/t (continuous-time ELBO)

Inference (Reverse Diffusion)

  1. Start with source tokens + fully masked target: [source | MASK MASK ... MASK]
  2. Over 50 denoising steps, iteratively predict and unmask tokens
  3. At each step t → s: predict all masked tokens, randomly re-mask a fraction s/t
  4. Final step: all remaining masks are filled with predictions

Training Details

Setting Value
Dataset WMT14 EN-DE (~4.5M parallel sentence pairs)
Optimizer AdamW (lr=3e-4, β₁=0.9, β₂=0.98, wd=0.01)
Schedule Cosine with 4K linear warmup
Effective batch size 256 (64 × 4 gradient accumulation)
Max steps 200,000
Mixed precision FP16
Gradient clipping max_norm=1.0
Evaluation SacreBLEU on WMT14 test set every 20K steps

Quick Start

Install dependencies

pip install torch transformers datasets trackio sacrebleu sacremoses sentencepiece protobuf

Train

git clone https://huggingface.co/vedkdev/text-diffusion-en-de
cd text-diffusion-en-de
python train.py

The script will:

  • Download WMT14 EN-DE automatically
  • Train for 200K steps with logging via Trackio
  • Evaluate SacreBLEU periodically
  • Push checkpoints to this repo

Adjusting for your hardware

Edit the TRAIN_CONFIG dict in train.py:

GPU VRAM Recommended batch_size gradient_accumulation_steps
24GB (A10G/3090/4090) 64 4
16GB (T4/V100) 32 8
12GB (3060) 16 16
8GB (3070) 8 32

Inference (after training)

import torch, json
from train import DiffusionTranslator, DiffusionTranslatorConfig, generate
from transformers import AutoTokenizer

# Load checkpoint
config = DiffusionTranslatorConfig(**json.load(open("checkpoints/best/config.json")))
model = DiffusionTranslator(config)
model.load_state_dict(torch.load("checkpoints/best/model.pt", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained("checkpoints/best/")

# Translate
text = "The weather is nice today."
src = tokenizer(f"translate English to German: {text}",
                max_length=128, truncation=True, padding="max_length",
                return_tensors="pt")

gen_ids = generate(model, src["input_ids"], torch.zeros_like(src["input_ids"]),
                   config, num_steps=50, device="cpu")
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))

Expected Results

Based on published literature for similar architectures on WMT14 EN→DE:

Model BLEU Reference
Autoregressive Transformer ~27 Vaswani et al.
DiNoiSer (continuous diffusion) 24.6 Ye et al. 2023
SeqDiffuSeq 19.8 Yuan et al. 2022
E2D2 (discrete diffusion) 24.8 Kuleshov et al. 2024
This model (target) 15-20 ~72M params, no KD

Note: Text diffusion models typically score 2-5 BLEU below autoregressive transformers of similar size. Knowledge distillation (KD) from an AR teacher can close the gap by ~1-2 BLEU.

Citation

If you use this model, please cite the foundational papers:

@article{sahoo2024mdlm,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Kuleshov, Volodymyr},
  journal={NeurIPS},
  year={2024}
}

@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and Zhu, Fengqi and You, Chao and Zhang, Xiaojun and Ou, Zhenguo and Zhu, Jun},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

@article{ye2023dinoiser,
  title={DiNoiSer: Diffused Conditional Sequence Learning by Manipulating Noises},
  author={Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan},
  journal={ACL},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vedkdev/text-diffusion-en-de

Papers for vedkdev/text-diffusion-en-de