Qwen3-Next-80B-A3B-Instruct — MLX 6-bit (group size 64)

Summary. This is a 6-bit (int6) MLX quantization of Qwen3-Next-80B-A3B-Instruct with group size 64. Built for Apple Silicon with Metal acceleration.

Base model: Qwen/Qwen3-Next-80B-A3B-Instruct (apache-2.0)
Quantization: MLX int6, q_group_size=64 (some tensors may remain 16-bit for stability)
Files: MLX weight shards + config.json; tokenizer files included for drop-in use
Intended use: local inference / research on M-series Macs
Not intended for: safety-critical decisions; outputs may be inaccurate or biased

Requirements

Runs on Apple Silicon (M1 or newer) with macOS ≥ 13.5 via MLX (Metal).

Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
Memory guidance: large unified memory recommended (96 GB provides comfortable headroom). The effective GPU working set is capped by Metal’s budget; keep 5–10% headroom.

How to use (MLX)

pip install mlx-lm

from mlx_lm import load, generate

# Use the uploaded HF repo or a local path to the MLX export
model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64")
print(generate(
    model, tokenizer,
    prompt="Explain the Chudnovsky algorithm to compute π.",
    max_tokens=256, max_kv_size=512
))

python -m mlx_lm generate --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64 \
  --prompt "Explain the Chudnovsky algorithm to compute pi." \
  --max-kv-size 512 --max-tokens 256

Evaluation

Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test); fast preset with window=stride=4096, ~100k tokens, EOS inserted between docs.

Variant	PPL (ctx=4096, fast)
MLX bf16 (reference)	5.14
MLX 6-bit (gs=64)	5.14 (≈0.0% vs bf16)
MLX 5-bit (gs=32)	5.20 (+1.2% vs bf16, +1.2% vs 6b/gs64)
MLX 4-bit (gs=64)	5.43 (+5.6% vs bf16, +5.6% vs 6b/gs64)

Notes:

Numbers from local MLX runs on Apple Silicon; small variations are expected with tokenizer details, logits dtype, and token subset.
For more sensitive comparisons, use overlapping windows (for example, --stride 512) and evaluate the full split.

Interpretation

6-bit gs64 matches the bf16 reference on this corpus, making it the quality pick.
5-bit gs32 is near-par in PPL and strong on deterministic math probes (smaller footprint).
4-bit gs64 shows a modest drop; choose it when footprint/throughput matter most.

Reproduce locally:

python python/scripts/test_perplexity-mlx.py \
  --model_path "/path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64" \
  --fast --progress

Conversion details (provenance)

python -m mlx_lm convert \
  --hf-path Qwen3-Next-80B-A3B-Instruct \
  --mlx-path /path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64 \
  -q --q-bits 6 --q-group-size 64

Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.

Sibling & reference models

halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32
halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64

Limitations and biases

Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review. Large models can be sensitive to prompt wording; prefer explicit, structured prompts.

License and credits

License: apache-2.0 (inherits from the base model)
Base model: Qwen/Qwen3-Next-80B-A3B-Instruct
Quantization: Halley AI Lab (MLX int6, gs=64)
Please cite both the base model and this repository when you use the weights.

Downloads last month: 183

Safetensors

Model size

80B params

Tensor type

BF16

U32

Model tree for halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64

Base model

Qwen/Qwen3-Next-80B-A3B-Instruct

Quantized

(37)

this model