Yuan3.0-Flash-GGUF

⚠️ Bleeding Edge: This GGUF requires a custom llama.cpp build with Yuan3.0 support. Not yet in mainstream llama.cpp.

See Links below for the custom branch and Docker images.

GGUF quantized versions of YuanLabAI/Yuan3.0-Flash, a 40B parameter multimodal MoE model (~3.7B activated).

Model Overview

Attribute Value
Base Model YuanLabAI/Yuan3.0-Flash
Architecture MoE (256 experts, 8 activated + 1 shared)
Total Parameters 40B
Activated Parameters ~3.7B
Context Length 128K
Input Modality Text + Images

Available Quantizations

Quantization Size Use Case
F16 (3 shards) ~77GB Full precision
Q4_K_M ~23GB Good balance of speed/quality

Quickstart

Option 1: Pre-built Docker Image (Recommended)

# Pull the latest image
docker pull ghcr.io/qades/llama.cpp:latest

# Run with GPU
docker run --gpus all -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-cli -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

# Or use the OAI-compatible server
docker run --gpus all -p 8080:8080 -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-server -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf -c 131072

Available tags:

  • latest - Main branch
  • yuan3_0 - Yuan3.0 specific branch
  • sha-XXXXXX - Specific commits

Option 2: Build from Source

# Clone the custom branch
cd ~/llama.cpp
git checkout yuan3_0
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Run
./build/bin/llama-cli -m ../Yuan3.0-Flash-GGUF/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj ../Yuan3.0-Flash-GGUF/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

Option 3: llama-cpp-python

Build from the custom branch, then:

from llama_cpp import Llama

llm = Llama(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
    n_ctx=131072,
    n_gpu_layers=-1,
)

# Text-only
output = llm("Explain quantum computing in simple terms")

# With image
from llama_cpp import LlamaVision

llm = LlamaVision(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
)
output = llm([{"type": "image", "image": "photo.jpg"}, {"type": "text", "text": "What do you see?"}])

Option 4: Ollama

Create Modelfile:

FROM ./Yuan3.0-Flash-Q4_K_M.gguf
PARAMETER mmproj ./mmproj-Yuan3.0-Flash-f16.gguf
PARAMETER context_length 131072
PARAMETER temperature 0.7
ollama create yuan3.0-flash -f Modelfile
ollama run yuan3.0-flash

vLLM (FP16 recommended for GPU)

vllm serve YuanLabAI/Yuan3.0-Flash \
    --dtype half \
    --max-model-len 131072

Memory Requirements

Quantization RAM/VRAM
F16 ~80GB
Q4_K_M ~24GB
Q4_K_M + CPU offload ~8GB VRAM

Performance Notes

  • Context length tested up to 131K tokens
  • Vision encoding adds ~0.5s per image
  • Recommended: --threads 8 for CPU inference

Citation

If you use this GGUF conversion:

@software{yuan3.0flash_gguf,
  title = {Yuan3.0-Flash-GGUF},
  author = {Michael Klaus},
  year = {2025},
  url = {https://huggingface.co/YuanLabAI/Yuan3.0-Flash}
}

Original model:

@misc{yuan3.0flash,
  title = {Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications},
  author = {YuanLab AI},
  year = {2025},
  eprint = {2601.01718},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Files

Yuan3.0-Flash-GGUF/
β”œβ”€β”€ mmproj-Yuan3.0-Flash-f16.gguf       # Vision projector (~744MB)
β”œβ”€β”€ Yuan3.0-Flash-f16-00001-of-00003.gguf  # F16 shard 1 (~21GB)
β”œβ”€β”€ Yuan3.0-Flash-f16-00002-of-00003.gguf  # F16 shard 2 (~27GB)
β”œβ”€β”€ Yuan3.0-Flash-f16-00003-of-00003.gguf  # F16 shard 3 (~29GB)
└── Yuan3.0-Flash-Q4_K_M.gguf            # Q4_K_M quantization (~23GB)

Converted using llama.cpp (QaDeS branch)

Links

Resource URL
Custom llama.cpp branch github.com/QaDeS/llama.cpp/tree/yuan3_0
Docker images ghcr.io/qades/llama.cpp
Base model (HuggingFace) YuanLabAI/Yuan3.0-Flash
Local llama.cpp build ~/llama.cpp
Original model paper arXiv:2601.01718
Downloads last month
-
GGUF
Model size
40B params
Architecture
yuan
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mkit/Yuan3.0-Flash-GGUF

Quantized
(1)
this model

Paper for mkit/Yuan3.0-Flash-GGUF