Qwen3-Next-80B-A3B-Instruct โ€” MLX 6-bit (group size 64)

Summary. This is a 6-bit (int6) MLX quantization of Qwen3-Next-80B-A3B-Instruct with group size 64. Built for Apple Silicon with Metal acceleration.

  • Base model: Qwen/Qwen3-Next-80B-A3B-Instruct (apache-2.0)
  • Quantization: MLX int6, q_group_size=64 (some tensors may remain 16-bit for stability)
  • Files: MLX weight shards + config.json; tokenizer files included for drop-in use
  • Intended use: local inference / research on M-series Macs
  • Not intended for: safety-critical decisions; outputs may be inaccurate or biased

Requirements

Runs on Apple Silicon (M1 or newer) with macOS โ‰ฅ 13.5 via MLX (Metal).

  • Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
  • Memory guidance: large unified memory recommended (96 GB provides comfortable headroom). The effective GPU working set is capped by Metalโ€™s budget; keep 5โ€“10% headroom.

How to use (MLX)

pip install mlx-lm
from mlx_lm import load, generate

# Use the uploaded HF repo or a local path to the MLX export
model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64")
print(generate(
    model, tokenizer,
    prompt="Explain the Chudnovsky algorithm to compute ฯ€.",
    max_tokens=256, max_kv_size=512
))
python -m mlx_lm generate --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64 \
  --prompt "Explain the Chudnovsky algorithm to compute pi." \
  --max-kv-size 512 --max-tokens 256

Evaluation

Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test); fast preset with window=stride=4096, ~100k tokens, EOS inserted between docs.

Variant PPL (ctx=4096, fast)
MLX bf16 (reference) 5.14
MLX 6-bit (gs=64) 5.14 (โ‰ˆ0.0% vs bf16)
MLX 5-bit (gs=32) 5.20 (+1.2% vs bf16, +1.2% vs 6b/gs64)
MLX 4-bit (gs=64) 5.43 (+5.6% vs bf16, +5.6% vs 6b/gs64)

Notes:

  • Numbers from local MLX runs on Apple Silicon; small variations are expected with tokenizer details, logits dtype, and token subset.
  • For more sensitive comparisons, use overlapping windows (for example, --stride 512) and evaluate the full split.

Interpretation

  • 6-bit gs64 matches the bf16 reference on this corpus, making it the quality pick.
  • 5-bit gs32 is near-par in PPL and strong on deterministic math probes (smaller footprint).
  • 4-bit gs64 shows a modest drop; choose it when footprint/throughput matter most.

Reproduce locally:

python python/scripts/test_perplexity-mlx.py \
  --model_path "/path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64" \
  --fast --progress

Conversion details (provenance)

python -m mlx_lm convert \
  --hf-path Qwen3-Next-80B-A3B-Instruct \
  --mlx-path /path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64 \
  -q --q-bits 6 --q-group-size 64
  • Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.

Sibling & reference models

  • halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32
  • halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64

Limitations and biases

Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review. Large models can be sensitive to prompt wording; prefer explicit, structured prompts.

License and credits

  • License: apache-2.0 (inherits from the base model)
  • Base model: Qwen/Qwen3-Next-80B-A3B-Instruct
  • Quantization: Halley AI Lab (MLX int6, gs=64)
  • Please cite both the base model and this repository when you use the weights.
Downloads last month
183
Safetensors
Model size
80B params
Tensor type
BF16
ยท
U32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64

Quantized
(37)
this model