Qwen3-Next-80B-A3B-Instruct โ MLX 6-bit (group size 64)
Summary. This is a 6-bit (int6) MLX quantization of Qwen3-Next-80B-A3B-Instruct with group size 64. Built for Apple Silicon with Metal acceleration.
- Base model:
Qwen/Qwen3-Next-80B-A3B-Instruct
(apache-2.0) - Quantization: MLX int6,
q_group_size=64
(some tensors may remain 16-bit for stability) - Files: MLX weight shards +
config.json
; tokenizer files included for drop-in use - Intended use: local inference / research on M-series Macs
- Not intended for: safety-critical decisions; outputs may be inaccurate or biased
Requirements
Runs on Apple Silicon (M1 or newer) with macOS โฅ 13.5 via MLX (Metal).
- Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
- Memory guidance: large unified memory recommended (96 GB provides comfortable headroom). The effective GPU working set is capped by Metalโs budget; keep 5โ10% headroom.
How to use (MLX)
pip install mlx-lm
from mlx_lm import load, generate
# Use the uploaded HF repo or a local path to the MLX export
model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64")
print(generate(
model, tokenizer,
prompt="Explain the Chudnovsky algorithm to compute ฯ.",
max_tokens=256, max_kv_size=512
))
python -m mlx_lm generate --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64 \
--prompt "Explain the Chudnovsky algorithm to compute pi." \
--max-kv-size 512 --max-tokens 256
Evaluation
Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test); fast preset with window=stride=4096
, ~100k tokens, EOS inserted between docs.
Variant | PPL (ctx=4096, fast) |
---|---|
MLX bf16 (reference) | 5.14 |
MLX 6-bit (gs=64) | 5.14 (โ0.0% vs bf16) |
MLX 5-bit (gs=32) | 5.20 (+1.2% vs bf16, +1.2% vs 6b/gs64) |
MLX 4-bit (gs=64) | 5.43 (+5.6% vs bf16, +5.6% vs 6b/gs64) |
Notes:
- Numbers from local MLX runs on Apple Silicon; small variations are expected with tokenizer details, logits dtype, and token subset.
- For more sensitive comparisons, use overlapping windows (for example,
--stride 512
) and evaluate the full split.
Interpretation
- 6-bit gs64 matches the bf16 reference on this corpus, making it the quality pick.
- 5-bit gs32 is near-par in PPL and strong on deterministic math probes (smaller footprint).
- 4-bit gs64 shows a modest drop; choose it when footprint/throughput matter most.
Reproduce locally:
python python/scripts/test_perplexity-mlx.py \
--model_path "/path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64" \
--fast --progress
Conversion details (provenance)
python -m mlx_lm convert \
--hf-path Qwen3-Next-80B-A3B-Instruct \
--mlx-path /path/to/Qwen3-Next-80B-A3B-Instruct-6bit-gs64 \
-q --q-bits 6 --q-group-size 64
- Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.
Sibling & reference models
- halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32
- halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64
Limitations and biases
Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review. Large models can be sensitive to prompt wording; prefer explicit, structured prompts.
License and credits
- License: apache-2.0 (inherits from the base model)
- Base model: Qwen/Qwen3-Next-80B-A3B-Instruct
- Quantization: Halley AI Lab (MLX int6, gs=64)
- Please cite both the base model and this repository when you use the weights.
- Downloads last month
- 183
Model tree for halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct