Fast-dLLM v2 (1.5B) β€” Efficient Block-Diffusion LLM

πŸ“– Introduction

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency.

We present Fast-dLLM v2 β€” a carefully designed block diffusion language model (dLLM) that efficiently adapts a pretrained AR model (Qwen2.5-1.5B-Instruct) into a diffusion-style decoder for parallel text generation.

Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.

✨ Key Innovations

  • Block Diffusion Mechanism + Complementary Attention Mask
    Enables blockwise bidirectional context modeling without sacrificing AR objectives.
  • Hierarchical Caching
    • Block-level cache: Stores historical context representations across blocks.
    • Sub-block cache: Parallel decoding within partially generated blocks.
  • Token Shift Mechanism
    Retains autoregressive characteristics while supporting bidirectional context within blocks.
  • Parallel Decoding Pipeline
    Achieves up to 2.5Γ— speedup over standard AR decoding without compromising quality.

πŸš€ Fast-dLLM v2 uses only ~1B tokens for fine-tuning β€” a 500Γ— reduction vs. full-attention diffusion LLMs (Dream: 580B tokens) β€” while matching or surpassing AR baselines in accuracy.

Generation Process


πŸ›  Model Overview

  • Type: Block Diffusion Language Model (dLLM)
  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Architecture: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings
  • Params: 1.54B (non-embedding: 1.31B)
  • Layers: 28
  • Attention Heads: 12 (Q), 2 (KV, GQA)
  • Key Feature: Parallel block-wise decoding + hierarchical caching

πŸ“¦ Installation

You will need transformers, torch, and our custom generation function:

pip install transformers torch numpy

πŸš€ Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Fast-dLLM v2 parallel decoding
gen_ids = model.generate(
    inputs["input_ids"],
    tokenizer=tokenizer,
    max_new_tokens=512,
    small_block_size=8,
    threshold=0.9,
)

response = tokenizer.decode(
    gen_ids[0][inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)
print(response)

πŸ“Š Performance & Benchmarks

β–Ά Real-time Throughput

Fast-dLLM v2 offers up to 2.54Γ— higher throughput than Qwen2.5-7B-Instruct, without loss in quality.

Throughput Comparison


πŸ† Benchmark Results

We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).

  • 1B group: Fast-dLLM v2 (1.5B) achieves best average score: 45.0.
  • 7B group: Fast-dLLM v2 (7B) achieves best average score: 60.3, surpassing LLaDA and Dream models.

Benchmark Results


πŸ“œ Citation

If you use Fast-dLLM v2 in your research or products, please cite:

@misc{wu2025fastdllmv2efficientblockdiffusion,
      title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2509.26328},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.26328}, 
}

πŸ“„ License

Released under Apache 2.0, following the base Qwen2.5 license.


πŸ”— Resources

Downloads last month
1,285
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Efficient-Large-Model/Fast_dLLM_v2_1.5B

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1219)
this model

Collection including Efficient-Large-Model/Fast_dLLM_v2_1.5B