Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications
Paper β’ 2601.01718 β’ Published β’ 1
β οΈ Bleeding Edge: This GGUF requires a custom llama.cpp build with Yuan3.0 support. Not yet in mainstream llama.cpp.
See Links below for the custom branch and Docker images.
GGUF quantized versions of YuanLabAI/Yuan3.0-Flash, a 40B parameter multimodal MoE model (~3.7B activated).
| Attribute | Value |
|---|---|
| Base Model | YuanLabAI/Yuan3.0-Flash |
| Architecture | MoE (256 experts, 8 activated + 1 shared) |
| Total Parameters | 40B |
| Activated Parameters | ~3.7B |
| Context Length | 128K |
| Input Modality | Text + Images |
| Quantization | Size | Use Case |
|---|---|---|
| F16 (3 shards) | ~77GB | Full precision |
| Q4_K_M | ~23GB | Good balance of speed/quality |
# Pull the latest image
docker pull ghcr.io/qades/llama.cpp:latest
# Run with GPU
docker run --gpus all -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
ghcr.io/qades/llama.cpp:latest \
./llama-cli -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf \
-c 131072 -n 4096 --temp 0.7
# Or use the OAI-compatible server
docker run --gpus all -p 8080:8080 -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
ghcr.io/qades/llama.cpp:latest \
./llama-server -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf -c 131072
Available tags:
latest - Main branchyuan3_0 - Yuan3.0 specific branchsha-XXXXXX - Specific commits# Clone the custom branch
cd ~/llama.cpp
git checkout yuan3_0
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Run
./build/bin/llama-cli -m ../Yuan3.0-Flash-GGUF/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj ../Yuan3.0-Flash-GGUF/mmproj-Yuan3.0-Flash-f16.gguf \
-c 131072 -n 4096 --temp 0.7
Build from the custom branch, then:
from llama_cpp import Llama
llm = Llama(
model_path="Yuan3.0-Flash-Q4_K_M.gguf",
mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
n_ctx=131072,
n_gpu_layers=-1,
)
# Text-only
output = llm("Explain quantum computing in simple terms")
# With image
from llama_cpp import LlamaVision
llm = LlamaVision(
model_path="Yuan3.0-Flash-Q4_K_M.gguf",
mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
)
output = llm([{"type": "image", "image": "photo.jpg"}, {"type": "text", "text": "What do you see?"}])
Create Modelfile:
FROM ./Yuan3.0-Flash-Q4_K_M.gguf
PARAMETER mmproj ./mmproj-Yuan3.0-Flash-f16.gguf
PARAMETER context_length 131072
PARAMETER temperature 0.7
ollama create yuan3.0-flash -f Modelfile
ollama run yuan3.0-flash
vllm serve YuanLabAI/Yuan3.0-Flash \
--dtype half \
--max-model-len 131072
| Quantization | RAM/VRAM |
|---|---|
| F16 | ~80GB |
| Q4_K_M | ~24GB |
| Q4_K_M + CPU offload | ~8GB VRAM |
--threads 8 for CPU inferenceIf you use this GGUF conversion:
@software{yuan3.0flash_gguf,
title = {Yuan3.0-Flash-GGUF},
author = {Michael Klaus},
year = {2025},
url = {https://huggingface.co/YuanLabAI/Yuan3.0-Flash}
}
Original model:
@misc{yuan3.0flash,
title = {Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications},
author = {YuanLab AI},
year = {2025},
eprint = {2601.01718},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
Yuan3.0-Flash-GGUF/
βββ mmproj-Yuan3.0-Flash-f16.gguf # Vision projector (~744MB)
βββ Yuan3.0-Flash-f16-00001-of-00003.gguf # F16 shard 1 (~21GB)
βββ Yuan3.0-Flash-f16-00002-of-00003.gguf # F16 shard 2 (~27GB)
βββ Yuan3.0-Flash-f16-00003-of-00003.gguf # F16 shard 3 (~29GB)
βββ Yuan3.0-Flash-Q4_K_M.gguf # Q4_K_M quantization (~23GB)
Converted using llama.cpp (QaDeS branch)
| Resource | URL |
|---|---|
| Custom llama.cpp branch | github.com/QaDeS/llama.cpp/tree/yuan3_0 |
| Docker images | ghcr.io/qades/llama.cpp |
| Base model (HuggingFace) | YuanLabAI/Yuan3.0-Flash |
| Local llama.cpp build | ~/llama.cpp |
| Original model paper | arXiv:2601.01718 |
4-bit
16-bit
Base model
YuanLabAI/Yuan3.0-Flash