Spaces:

HathoraResearch
/

LLM-KV-cache-calculator

Running

App Files Files Community

LLM-KV-cache-calculator / README.md

AndreHathora

Fix short_description length limit

fb095c2 7 days ago

preview code

raw

history blame contribute delete

1.16 kB

A newer version of the Gradio SDK is available: 5.46.0

Upgrade

metadata

title: LLM KV Cache Calculator
emoji: 💻
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.45.0
app_file: app.py
pinned: false
short_description: Calculate KV cache memory requirements for LLMs

KV Cache Calculator

Calculate KV cache memory requirements for transformer models.

Credits

This implementation is derived from and builds upon the excellent work by gaunernst. Special thanks for the original implementation!

Features

Multi-attention support: MHA (Multi-Head Attention), GQA (Grouped Query Attention), and MLA (Multi-head Latent Attention)
Multiple data types: fp16/bf16, fp8, and fp4 quantization
Real-time calculation: Instant memory requirement estimates
Model analysis: Detailed breakdown of model configuration
Universal compatibility: Works with any HuggingFace transformer model

Usage

Enter your model ID (e.g., "Qwen/Qwen3-30B-A3B")
Set context length and number of users
Choose data type precision
Add HuggingFace token if needed for gated models
Click calculate to get memory requirements