Papers
arxiv:2510.14973

Attention Is All You Need for KV Cache in Diffusion LLMs

Published on Oct 16
ยท Submitted by Mukul Ranjan on Oct 17
Authors:

Abstract

Elastic-Cache optimizes key-value cache management in diffusion large language models to reduce decoding latency without sacrificing prediction accuracy.

AI-generated summary

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant {bf MASK} tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose {bf Elastic-Cache}, a training-free, architecture-agnostic strategy that jointly decides {when} to refresh (via an attention-aware drift test on the most-attended token) and {where} to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: 8.7times on GSM8K (256 tokens), 45.1times on longer sequences, and 4.8times on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput (6.8times on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

Community

Paper author Paper submitter

๐Ÿš€ Attention Is All You Need for KV Cache in Diffusion LLMs ๐Ÿš€

Making Diffusion LLMs Practical! We introduce Elastic-Cache, the first adaptive, layer-aware KV caching strategy for diffusion language models, achieving massive speedups without sacrificing generation quality.

๐Ÿš€Intelligent Cache Updates: Adaptively decides when to refresh (attention-aware drift detection) and where to refresh (depth-selective updates), eliminating redundant computation across denoising steps.

๐Ÿš€๐Ÿš€Exceptional Speedups : Achieves 8.7ร— faster inference on GSM8K, 45.1ร— on longer sequences, and 4.8ร— on HumanEval, while maintaining or even improving accuracy compared to baselines.

๐Ÿš€๐Ÿš€๐Ÿ”ฅ Training-Free & Universal : Works out-of-the-box with any diffusion LLM architecture. No retraining needed, just plug and play!

๐Ÿ”— Paper: https://arxiv.org/abs/2510.14973
๐Ÿ”— Project page: https://vila-lab.github.io/elastic-cache-webpage/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.14973 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.14973 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.14973 in a Space README.md to link it from this page.

Collections including this paper 3