πŸš€ dParallel: Learnable Parallel Decoding for dLLMs

dParallel: Learnable Parallel Decoding for dLLMs
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
xML Lab, National University of Singapore

πŸ’‘ Introduction

We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy.


Overview of proposed certainty-forcing distillation.

πŸ’» Model and Datasets

πŸ“„ Paper ArXiv-Link
πŸ€– LLaDA Model dParallel-LLaDA-8B-instruct
πŸ€– Dream Model dParallel-Dream-7B-instruct
πŸ“Š LLaDA Data dParallel-LLaDA-Distill Dataset
πŸ“Š Dream Data dParallel-Dream-Distill Dataset

πŸš€ Quick Start:

import torch
from transformers import AutoModel, AutoTokenizer
import types

model_path = "Zigeng/dParallel_Dream_7B_Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.to("cuda").eval()

from model.generation_utils_semiar import DreamGenerationMixin
model.diffusion_generate = types.MethodType(DreamGenerationMixin.diffusion_generate, model)
model._sample = types.MethodType(DreamGenerationMixin._sample, model)


messages = [
    {"role": "user", "content": "Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. How many sheep do Toulouse, Charleston, and Seattle have together if Seattle has 20 sheep? Let's think step by step."}
]

inputs =  tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True, add_generation_prompt=True
)
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")

output, nfe = model.diffusion_generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=256,
        output_history=False,
        return_dict_in_generate=True,
        steps=256,
        temperature=0.,
        top_p=None,
        alg="entropy_threshold",
        alg_temp=0.1,
        top_k=None,
        block_length=32,
        threshold=0.5,
    )

generations = [
    tokenizer.decode(g[0:].tolist())
    for p, g in zip(input_ids, output.sequences)
]

print(generations[0].split(tokenizer.eos_token)[0])
print("NFE:", nfe)

πŸ“– Experimental Results

Results on LLaDA-8B-Instruct:

llada-exp

Results on Dream-7B-Instruct:

dream-exp

Better Speed-Accuracy Trade-off:

trade-off

β˜€οΈ Acknowledgement

Our code builds on LLaDA, Dream, Fast-dLLM, and dKV-Cache, and we acknowledge these great works for laying the groundwork that made our approach possible.

Citation

If our research assists your work, please give us a star ⭐ or cite us using:

@article{chen2025dparallel,
  title={dParallel: Learnable Parallel Decoding for dLLMs},
  author={Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Yu, Ruonan and Wang, Xinchao},
  journal={arXiv preprint arXiv:2509.26488},
  year={2025}
}
Downloads last month
161
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support