AndreHathora's picture
Fix short_description length limit
fb095c2

A newer version of the Gradio SDK is available: 5.46.0

Upgrade
metadata
title: LLM KV Cache Calculator
emoji: 💻
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.45.0
app_file: app.py
pinned: false
short_description: Calculate KV cache memory requirements for LLMs

KV Cache Calculator

Calculate KV cache memory requirements for transformer models.

Credits

This implementation is derived from and builds upon the excellent work by gaunernst. Special thanks for the original implementation!

Features

  • Multi-attention support: MHA (Multi-Head Attention), GQA (Grouped Query Attention), and MLA (Multi-head Latent Attention)
  • Multiple data types: fp16/bf16, fp8, and fp4 quantization
  • Real-time calculation: Instant memory requirement estimates
  • Model analysis: Detailed breakdown of model configuration
  • Universal compatibility: Works with any HuggingFace transformer model

Usage

  1. Enter your model ID (e.g., "Qwen/Qwen3-30B-A3B")
  2. Set context length and number of users
  3. Choose data type precision
  4. Add HuggingFace token if needed for gated models
  5. Click calculate to get memory requirements