Text Diffusion Model for EN→DE Translation

A masked discrete diffusion model for English-to-German machine translation, trained from scratch on WMT14 EN-DE.

Architecture

Component	Detail
Type	Masked Discrete Diffusion
Backbone	DiT (Diffusion Transformer) with adaLN
Parameters	~72M
Blocks	12 DiT blocks
Hidden dim	512, 8 attention heads
Attention	Bidirectional (no causal mask) with RoPE
Conditioning	Timestep via sinusoidal embeddings + adaLN; Segment embeddings for src/tgt
Weight tying	Input embeddings tied to output projection
Tokenizer	Helsinki-NLP/opus-mt-en-de (~58K vocab)
Max sequence	128 src + 128 tgt tokens

Inspired by

MDLM — DiT backbone architecture, masked diffusion objective
LLaDA — Conditional generation via SFT (keep prompt unmasked, mask only target), 1/t ELBO weighting
DiNoiSer — Noise manipulation for conditional seq2seq diffusion

How It Works

Training (Forward Diffusion)

Source (EN) and target (DE) tokens are concatenated: [source | target]
A random masking rate t ~ Uniform(0, 1) is sampled per example
Each target token is independently masked with probability t
The bidirectional DiT predicts all masked tokens simultaneously
Loss = cross-entropy on masked positions only, weighted by 1/t (continuous-time ELBO)

Inference (Reverse Diffusion)

Start with source tokens + fully masked target: [source | MASK MASK ... MASK]
Over 50 denoising steps, iteratively predict and unmask tokens
At each step t → s: predict all masked tokens, randomly re-mask a fraction s/t
Final step: all remaining masks are filled with predictions

Training Details

Setting	Value
Dataset	WMT14 EN-DE (~4.5M parallel sentence pairs)
Optimizer	AdamW (lr=3e-4, β₁=0.9, β₂=0.98, wd=0.01)
Schedule	Cosine with 4K linear warmup
Effective batch size	256 (64 × 4 gradient accumulation)
Max steps	200,000
Mixed precision	FP16
Gradient clipping	max_norm=1.0
Evaluation	SacreBLEU on WMT14 test set every 20K steps

Quick Start

Install dependencies

pip install torch transformers datasets trackio sacrebleu sacremoses sentencepiece protobuf

Train

git clone https://huggingface.co/vedkdev/text-diffusion-en-de
cd text-diffusion-en-de
python train.py

The script will:

Download WMT14 EN-DE automatically
Train for 200K steps with logging via Trackio
Evaluate SacreBLEU periodically
Push checkpoints to this repo

Adjusting for your hardware

Edit the TRAIN_CONFIG dict in train.py:

GPU VRAM	Recommended `batch_size`	`gradient_accumulation_steps`
24GB (A10G/3090/4090)	64	4
16GB (T4/V100)	32	8
12GB (3060)	16	16
8GB (3070)	8	32

Inference (after training)

import torch, json
from train import DiffusionTranslator, DiffusionTranslatorConfig, generate
from transformers import AutoTokenizer

# Load checkpoint
config = DiffusionTranslatorConfig(**json.load(open("checkpoints/best/config.json")))
model = DiffusionTranslator(config)
model.load_state_dict(torch.load("checkpoints/best/model.pt", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained("checkpoints/best/")

# Translate
text = "The weather is nice today."
src = tokenizer(f"translate English to German: {text}",
                max_length=128, truncation=True, padding="max_length",
                return_tensors="pt")

gen_ids = generate(model, src["input_ids"], torch.zeros_like(src["input_ids"]),
                   config, num_steps=50, device="cpu")
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))

Expected Results

Based on published literature for similar architectures on WMT14 EN→DE:

Model	BLEU	Reference
Autoregressive Transformer	~27	Vaswani et al.
DiNoiSer (continuous diffusion)	24.6	Ye et al. 2023
SeqDiffuSeq	19.8	Yuan et al. 2022
E2D2 (discrete diffusion)	24.8	Kuleshov et al. 2024
This model (target)	15-20	~72M params, no KD

Note: Text diffusion models typically score 2-5 BLEU below autoregressive transformers of similar size. Knowledge distillation (KD) from an AR teacher can close the gap by ~1-2 BLEU.

Citation

If you use this model, please cite the foundational papers:

@article{sahoo2024mdlm,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Kuleshov, Volodymyr},
  journal={NeurIPS},
  year={2024}
}

@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and Zhu, Fengqi and You, Chao and Zhang, Xiaojun and Ou, Zhenguo and Zhu, Jun},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

@article{ye2023dinoiser,
  title={DiNoiSer: Diffused Conditional Sequence Learning by Manipulating Noises},
  author={Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan},
  journal={ACL},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vedkdev/text-diffusion-en-de

Papers for vedkdev/text-diffusion-en-de