Yuan3.0-Flash-GGUF

⚠️ Bleeding Edge: This GGUF requires a custom llama.cpp build with Yuan3.0 support. Not yet in mainstream llama.cpp.

See Links below for the custom branch and Docker images.

GGUF quantized versions of YuanLabAI/Yuan3.0-Flash, a 40B parameter multimodal MoE model (~3.7B activated).

Model Overview

Attribute	Value
Base Model	YuanLabAI/Yuan3.0-Flash
Architecture	MoE (256 experts, 8 activated + 1 shared)
Total Parameters	40B
Activated Parameters	~3.7B
Context Length	128K
Input Modality	Text + Images

Available Quantizations

Quantization	Size	Use Case
F16 (3 shards)	~77GB	Full precision
Q4_K_M	~23GB	Good balance of speed/quality

Quickstart

Option 1: Pre-built Docker Image (Recommended)

# Pull the latest image
docker pull ghcr.io/qades/llama.cpp:latest

# Run with GPU
docker run --gpus all -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-cli -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

# Or use the OAI-compatible server
docker run --gpus all -p 8080:8080 -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-server -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf -c 131072

Available tags:

latest - Main branch
yuan3_0 - Yuan3.0 specific branch
sha-XXXXXX - Specific commits

Option 2: Build from Source

# Clone the custom branch
cd ~/llama.cpp
git checkout yuan3_0
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Run
./build/bin/llama-cli -m ../Yuan3.0-Flash-GGUF/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj ../Yuan3.0-Flash-GGUF/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

Option 3: llama-cpp-python

Build from the custom branch, then:

from llama_cpp import Llama

llm = Llama(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
    n_ctx=131072,
    n_gpu_layers=-1,
)

# Text-only
output = llm("Explain quantum computing in simple terms")

# With image
from llama_cpp import LlamaVision

llm = LlamaVision(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
)
output = llm([{"type": "image", "image": "photo.jpg"}, {"type": "text", "text": "What do you see?"}])

Option 4: Ollama

Create Modelfile:

FROM ./Yuan3.0-Flash-Q4_K_M.gguf
PARAMETER mmproj ./mmproj-Yuan3.0-Flash-f16.gguf
PARAMETER context_length 131072
PARAMETER temperature 0.7

ollama create yuan3.0-flash -f Modelfile
ollama run yuan3.0-flash

vLLM (FP16 recommended for GPU)

vllm serve YuanLabAI/Yuan3.0-Flash \
    --dtype half \
    --max-model-len 131072

Memory Requirements

Quantization	RAM/VRAM
F16	~80GB
Q4_K_M	~24GB
Q4_K_M + CPU offload	~8GB VRAM

Performance Notes

Context length tested up to 131K tokens
Vision encoding adds ~0.5s per image
Recommended: --threads 8 for CPU inference

Citation

If you use this GGUF conversion:

@software{yuan3.0flash_gguf,
  title = {Yuan3.0-Flash-GGUF},
  author = {Michael Klaus},
  year = {2025},
  url = {https://huggingface.co/YuanLabAI/Yuan3.0-Flash}
}

Original model:

@misc{yuan3.0flash,
  title = {Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications},
  author = {YuanLab AI},
  year = {2025},
  eprint = {2601.01718},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Files

Yuan3.0-Flash-GGUF/
├── mmproj-Yuan3.0-Flash-f16.gguf       # Vision projector (~744MB)
├── Yuan3.0-Flash-f16-00001-of-00003.gguf  # F16 shard 1 (~21GB)
├── Yuan3.0-Flash-f16-00002-of-00003.gguf  # F16 shard 2 (~27GB)
├── Yuan3.0-Flash-f16-00003-of-00003.gguf  # F16 shard 3 (~29GB)
└── Yuan3.0-Flash-Q4_K_M.gguf            # Q4_K_M quantization (~23GB)

Converted using llama.cpp (QaDeS branch)

Links

Resource	URL
Custom llama.cpp branch	github.com/QaDeS/llama.cpp/tree/yuan3_0
Docker images	ghcr.io/qades/llama.cpp
Base model (HuggingFace)	YuanLabAI/Yuan3.0-Flash
Local llama.cpp build	`~/llama.cpp`
Original model paper	arXiv:2601.01718

Downloads last month: -

GGUF

Model size

40B params

Architecture

yuan

Hardware compatibility

4-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mkit/Yuan3.0-Flash-GGUF

Base model

YuanLabAI/Yuan3.0-Flash

Quantized

(1)

this model

Paper for mkit/Yuan3.0-Flash-GGUF

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

Paper • 2601.01718 • Published Jan 5 • 1