Qwen3-Next-80B-A3B-Thinking — MLX 4-bit (mxfp4)

This repository provides an Apple MLX-optimized 4-bit mxfp4 quantized checkpoint of the base model Qwen/Qwen3-Next-80B-A3B-Thinking for fast, memory‑efficient local inference on Apple Silicon.

Key details

Format: MLX runtime, safetensors sharded weights
Quantization: mxfp4, group_size=32 with selective 8‑bit gates
Task: text generation / chat
Tokenizer: provided via tokenizer.json (BPE) with chat_template.jinja

Files

model-0000X-of-00009.safetensors and model.safetensors.index.json — weights (LFS)
config.json — architecture and quantization keys for MLX loaders
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, added_tokens.json, special_tokens_map.json
chat_template.jinja — chat formatting for multi‑turn prompts
generation_config.json — sensible default generation params
LICENSE — Apache-2.0

Usage (MLX)

Install MLX-LM and run generation:

pip install mlx-lm

from mlx_lm import load, generate
repo_id = "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx"

model, tokenizer = load(repo_id)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what mxfp4 quantization is in one paragraph."},
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

out = generate(
    model,
    tokenizer,
    prompt,
    max_tokens=512,
    temp=0.7,
    top_p=0.95,
)
print(out)

CLI example:

mlx_lm.generate --model "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx" \
  --prompt "List 5 creative dinner ideas." --max-tokens 200

Hardware Notes

Apple Silicon recommended (M2/M3). The 4‑bit checkpoint reduces memory pressure significantly, but 80B‑class models still require substantial VRAM and system RAM. For best results, ensure macOS swap is available and limit max_tokens accordingly.

Benchmarks

Environment: Apple Silicon (isolated runs; one model in memory at a time).
Script: scripts/bench/qwen_mxfp4_vs_int4.py with --runs 1 --max-new 256.
Full JSON: bench_results.json
Results (representative, single pass):
- abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx
  - gen_tok_s: ≈ 37.5 tok/s; ttft: ≈ 2.58 s; mem_active: ≈ 42.36 GB
- mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit
  - gen_tok_s: ≈ 33.8 tok/s; ttft: ≈ 3.10 s; mem_active: ≈ 44.84 GB

Notes

Bench numbers vary with hardware, system load, and MLX version; treat as directional.
The bench enforces single‑residency by default. Use --no-isolate only if you explicitly need concurrent in‑process loads.

License

License: Apache-2.0 (this quantized packaging). See LICENSE.
Base model: Qwen/Qwen3-Next-80B-A3B-Thinking (see its model card for upstream license and usage terms). Ensure your use complies with the base model’s license.

Attribution

If you use this model, please credit the upstream Qwen team and note that this is an MLX 4‑bit mxfp4 quantized derivative for Apple Silicon.

Downloads last month: 281

Safetensors

Model size

80B params

Tensor type

U32

BF16

Model tree for abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx

Base model

Qwen/Qwen3-Next-80B-A3B-Thinking

Quantized

(26)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard