Qwen3-Next-80B-A3B-Thinking — MLX 4-bit (mxfp4)
This repository provides an Apple MLX-optimized 4-bit mxfp4 quantized checkpoint of the base model
Qwen/Qwen3-Next-80B-A3B-Thinking
for fast, memory‑efficient local inference on Apple Silicon.
Key details
- Format: MLX runtime, safetensors sharded weights
- Quantization: mxfp4, group_size=32 with selective 8‑bit gates
- Task: text generation / chat
- Tokenizer: provided via
tokenizer.json
(BPE) withchat_template.jinja
Files
model-0000X-of-00009.safetensors
andmodel.safetensors.index.json
— weights (LFS)config.json
— architecture and quantization keys for MLX loaderstokenizer.json
,tokenizer_config.json
,vocab.json
,merges.txt
,added_tokens.json
,special_tokens_map.json
chat_template.jinja
— chat formatting for multi‑turn promptsgeneration_config.json
— sensible default generation paramsLICENSE
— Apache-2.0
Usage (MLX)
Install MLX-LM and run generation:
pip install mlx-lm
from mlx_lm import load, generate
repo_id = "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx"
model, tokenizer = load(repo_id)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what mxfp4 quantization is in one paragraph."},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
out = generate(
model,
tokenizer,
prompt,
max_tokens=512,
temp=0.7,
top_p=0.95,
)
print(out)
CLI example:
mlx_lm.generate --model "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx" \
--prompt "List 5 creative dinner ideas." --max-tokens 200
Hardware Notes
- Apple Silicon recommended (M2/M3). The 4‑bit checkpoint reduces memory pressure significantly, but 80B‑class models still require substantial VRAM and system RAM. For best results, ensure macOS swap is available and limit
max_tokens
accordingly.
Benchmarks
- Environment: Apple Silicon (isolated runs; one model in memory at a time).
- Script:
scripts/bench/qwen_mxfp4_vs_int4.py
with--runs 1 --max-new 256
. - Full JSON: bench_results.json
- Results (representative, single pass):
abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx
- gen_tok_s: ≈ 37.5 tok/s; ttft: ≈ 2.58 s; mem_active: ≈ 42.36 GB
mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit
- gen_tok_s: ≈ 33.8 tok/s; ttft: ≈ 3.10 s; mem_active: ≈ 44.84 GB
Notes
- Bench numbers vary with hardware, system load, and MLX version; treat as directional.
- The bench enforces single‑residency by default. Use
--no-isolate
only if you explicitly need concurrent in‑process loads.
License
- License: Apache-2.0 (this quantized packaging). See
LICENSE
. - Base model:
Qwen/Qwen3-Next-80B-A3B-Thinking
(see its model card for upstream license and usage terms). Ensure your use complies with the base model’s license.
Attribution
If you use this model, please credit the upstream Qwen team and note that this is an MLX 4‑bit mxfp4 quantized derivative for Apple Silicon.
- Downloads last month
- 281
Model tree for abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx
Base model
Qwen/Qwen3-Next-80B-A3B-Thinking