---
title: Multi-GGUF LLM Inference
emoji: 🧠
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Run GGUF models with llama.cpp
---

This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`.

### 🔄 Supported Models:
- `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
- `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q4_K_M.gguf`
- `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q4_K_M.gguf`
- `MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF` → `Meta-Llama-3.1-8B-Instruct.Q2_K.gguf`
- `unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF` → `DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf`
- `MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF` → `Mistral-7B-Instruct-v0.3.IQ3_XS.gguf`
- `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`

### ⚙️ Features:
- Model selection in the sidebar
- Customizable system prompt and generation parameters
- Chat-style UI with streaming responses
- **Markdown output rendering** for readable, styled output
- **DeepSeek-compatible `<think>` tag handling** — shows model reasoning in a collapsible expander

### 🧠 Memory-Safe Design (for HuggingFace Spaces):
- Loads only **one model at a time** to prevent memory bloat
- Utilizes **manual unloading and `gc.collect()`** to free memory when switching models
- Adjusts `n_ctx` context length to operate within a 16 GB RAM limit
- Automatically downloads models as needed
- Limits history to the **last 8 user-assistant turns** to prevent context overflow

Ideal for deploying multiple GGUF chat models on **free-tier HuggingFace Spaces**!

Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference