Qwen3-Next-80B-A3B-Thinking — MLX 4-bit (mxfp4)

This repository provides an Apple MLX-optimized 4-bit mxfp4 quantized checkpoint of the base model Qwen/Qwen3-Next-80B-A3B-Thinking for fast, memory‑efficient local inference on Apple Silicon.

Key details

  • Format: MLX runtime, safetensors sharded weights
  • Quantization: mxfp4, group_size=32 with selective 8‑bit gates
  • Task: text generation / chat
  • Tokenizer: provided via tokenizer.json (BPE) with chat_template.jinja

Files

  • model-0000X-of-00009.safetensors and model.safetensors.index.json — weights (LFS)
  • config.json — architecture and quantization keys for MLX loaders
  • tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, added_tokens.json, special_tokens_map.json
  • chat_template.jinja — chat formatting for multi‑turn prompts
  • generation_config.json — sensible default generation params
  • LICENSE — Apache-2.0

Usage (MLX)

Install MLX-LM and run generation:

pip install mlx-lm
from mlx_lm import load, generate
repo_id = "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx"

model, tokenizer = load(repo_id)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what mxfp4 quantization is in one paragraph."},
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

out = generate(
    model,
    tokenizer,
    prompt,
    max_tokens=512,
    temp=0.7,
    top_p=0.95,
)
print(out)

CLI example:

mlx_lm.generate --model "abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx" \
  --prompt "List 5 creative dinner ideas." --max-tokens 200

Hardware Notes

  • Apple Silicon recommended (M2/M3). The 4‑bit checkpoint reduces memory pressure significantly, but 80B‑class models still require substantial VRAM and system RAM. For best results, ensure macOS swap is available and limit max_tokens accordingly.

Benchmarks

  • Environment: Apple Silicon (isolated runs; one model in memory at a time).
  • Script: scripts/bench/qwen_mxfp4_vs_int4.py with --runs 1 --max-new 256.
  • Full JSON: bench_results.json
  • Results (representative, single pass):
    • abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx
      • gen_tok_s: ≈ 37.5 tok/s; ttft: ≈ 2.58 s; mem_active: ≈ 42.36 GB
    • mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit
      • gen_tok_s: ≈ 33.8 tok/s; ttft: ≈ 3.10 s; mem_active: ≈ 44.84 GB

Notes

  • Bench numbers vary with hardware, system load, and MLX version; treat as directional.
  • The bench enforces single‑residency by default. Use --no-isolate only if you explicitly need concurrent in‑process loads.

License

  • License: Apache-2.0 (this quantized packaging). See LICENSE.
  • Base model: Qwen/Qwen3-Next-80B-A3B-Thinking (see its model card for upstream license and usage terms). Ensure your use complies with the base model’s license.

Attribution

If you use this model, please credit the upstream Qwen team and note that this is an MLX 4‑bit mxfp4 quantized derivative for Apple Silicon.

Downloads last month
281
Safetensors
Model size
80B params
Tensor type
U8
·
U32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abnormalmapstudio/Qwen3-Next-80B-A3B-Thinking-mxfp4-mlx

Quantized
(26)
this model