
Ling-2.6-flash-MXFP4
~103B-A8B hybrid MoE — 63 GB on disk (down from the 200 GB bf16 source) —
stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid
architecture. Loads via mlx_lm.load() with the bailing_hybrid model
class — no TurboQuant runtime, no sidecar required.
- Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
- Quantization: MXFP4 — every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at 4-bit affine group_size=32. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough.
- Bundle size: 63 GB on-disk across 51 shards
- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
Why two variants?
| JANGTQ2 | MXFP4 | |
|---|---|---|
| Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine |
| Attention / shared / dense | 8-bit affine | 4-bit affine |
| Bundle size | 30 GB | 63 GB |
| Quality | tighter (8-bit attention) | uniform 4-bit |
| Loader | jang_tools.load_jangtq (TurboQuant kernel) |
stock mlx_lm.load() |
| Sidecar | required | not needed |
| Min RAM | 64 GB | 96 GB |
JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack.
Architecture (bailing_hybrid)
Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.
| Layer block | Count | Attention | MLP |
|---|---|---|---|
| Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) |
| Layers 1–6, 8–14, 16–22, 24–30 | 27 | Linear (GLA) | MoE (256+1) |
| Layers 7, 15, 23, 31 | 4 | MLA (full softmax) | MoE (256+1) |
| MTP head (32) | 1 | MLA | MoE (256+1) |
See the JANGTQ variant card for the deeper architecture writeup.
Loading (Python)
pip install mlx-lm jang-tools
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
Stock mlx_lm.load() works once mlx_lm/models/bailing_hybrid.py is
present (shipped with jang-tools >= TBD). The bundle's
configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py
provide HF compatibility for tooling that goes through transformers.
Reasoning + tools
Default is detailed thinking off. To enable:
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "..."},
]
The model emits <think>...</think> reasoning blocks before answers when
thinking is on. DeepSeek-style tool-call format.
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Base model: inclusionAI — Ant Group's Bailing team
- Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
- Osaurus: osaurus.ai — Apple-Silicon-first inference for open-weight LLMs.
- Downloads last month
- -
Quantized
Model tree for OsaurusAI/Ling-2.6-flash-MXFP4
Base model
inclusionAI/Ling-2.6-flash