Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -17,13 +17,24 @@ tags: | |
| 17 | 
             
            - halley-ai
         | 
| 18 | 
             
            ---
         | 
| 19 |  | 
| 20 | 
            -
            #  | 
| 21 |  | 
| 22 | 
            -
            This  | 
| 23 | 
            -
            converted to MLX format from [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
         | 
| 24 | 
            -
            using mlx-lm version **0.28.0**.
         | 
| 25 |  | 
| 26 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 27 |  | 
| 28 | 
             
            ```bash
         | 
| 29 | 
             
            pip install mlx-lm
         | 
| @@ -33,14 +44,73 @@ pip install mlx-lm | |
| 33 | 
             
            from mlx_lm import load, generate
         | 
| 34 |  | 
| 35 | 
             
            model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64")
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 36 |  | 
| 37 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 38 |  | 
| 39 | 
            -
             | 
| 40 | 
            -
                messages = [{"role": "user", "content": prompt}]
         | 
| 41 | 
            -
                prompt = tokenizer.apply_chat_template(
         | 
| 42 | 
            -
                    messages, add_generation_prompt=True
         | 
| 43 | 
            -
                )
         | 
| 44 |  | 
| 45 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 46 | 
             
            ```
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 17 | 
             
            - halley-ai
         | 
| 18 | 
             
            ---
         | 
| 19 |  | 
| 20 | 
            +
            # Qwen3-Next-80B-A3B-Instruct — MLX 4-bit (group size 64)
         | 
| 21 |  | 
| 22 | 
            +
            **Summary.** This is a 4-bit (Q4) MLX quantization of Qwen3-Next-80B-A3B-Instruct with group size 64. Built for Apple Silicon with Metal acceleration.
         | 
|  | |
|  | |
| 23 |  | 
| 24 | 
            +
            - Base model: `Qwen/Qwen3-Next-80B-A3B-Instruct` (apache-2.0)
         | 
| 25 | 
            +
            - Quantization: MLX Q4, `q_group_size=64` (some tensors may remain 16-bit for stability)
         | 
| 26 | 
            +
            - Files: MLX weight shards + `config.json`; tokenizer files included for drop-in use
         | 
| 27 | 
            +
            - Intended use: lightweight local inference on M-series Macs
         | 
| 28 | 
            +
            - Not intended for: safety-critical decisions; outputs may be inaccurate or biased
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            ## Requirements
         | 
| 31 | 
            +
             | 
| 32 | 
            +
            Runs on Apple Silicon (M1 or newer) with macOS ≥ 13.5 via MLX (Metal).
         | 
| 33 | 
            +
             | 
| 34 | 
            +
            - Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
         | 
| 35 | 
            +
            - Memory guidance: large unified memory recommended (e.g., 64 GB+; 96 GB provides comfortable headroom). The effective GPU working set is capped by Metal’s budget; keep 5–10% headroom.
         | 
| 36 | 
            +
             | 
| 37 | 
            +
            ## How to use (MLX)
         | 
| 38 |  | 
| 39 | 
             
            ```bash
         | 
| 40 | 
             
            pip install mlx-lm
         | 
|  | |
| 44 | 
             
            from mlx_lm import load, generate
         | 
| 45 |  | 
| 46 | 
             
            model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64")
         | 
| 47 | 
            +
            print(generate(
         | 
| 48 | 
            +
                model, tokenizer,
         | 
| 49 | 
            +
                prompt="Explain the Chudnovsky algorithm to compute π.",
         | 
| 50 | 
            +
                max_tokens=256, max_kv_size=512
         | 
| 51 | 
            +
            ))
         | 
| 52 | 
            +
            ```
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            ```bash
         | 
| 55 | 
            +
            python -m mlx_lm generate --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64 \
         | 
| 56 | 
            +
              --prompt "Explain the Chudnovsky algorithm to compute pi." \
         | 
| 57 | 
            +
              --max-kv-size 512 --max-tokens 256
         | 
| 58 | 
            +
            ```
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            ## Evaluation
         | 
| 61 | 
            +
             | 
| 62 | 
            +
            Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test); fast preset with `window=stride=4096`, ~100k tokens, EOS inserted between docs.
         | 
| 63 | 
            +
             | 
| 64 | 
            +
            | Variant                 | PPL (ctx=4096, fast)                   |
         | 
| 65 | 
            +
            |-------------------------|----------------------------------------|
         | 
| 66 | 
            +
            | MLX bf16 (reference)    | 5.14                                   |
         | 
| 67 | 
            +
            | MLX 6-bit (gs=64)       | 5.14 (≈0.0% vs bf16)                   |
         | 
| 68 | 
            +
            | MLX 5-bit (gs=32)       | 5.20 (+1.2% vs bf16, +1.2% vs 6b/gs64) |
         | 
| 69 | 
            +
            | MLX 4-bit (gs=64)       | 5.43 (+5.6% vs bf16, +5.6% vs 6b/gs64) |
         | 
| 70 | 
            +
             | 
| 71 | 
            +
            ### Interpretation
         | 
| 72 | 
            +
             | 
| 73 | 
            +
            - 4-bit gs64 is the smallest footprint and shows a modest PPL increase versus 5/6‑bit.
         | 
| 74 | 
            +
            - 5-bit gs32 is a strong “quality‑light” option if you can spare ~15 GB more.
         | 
| 75 | 
            +
            - 6-bit gs64 matches bf16 on this corpus and is the quality pick.
         | 
| 76 | 
            +
             | 
| 77 | 
            +
            Reproduce locally:
         | 
| 78 |  | 
| 79 | 
            +
            ```bash
         | 
| 80 | 
            +
            python python/scripts/test_perplexity-mlx.py \
         | 
| 81 | 
            +
              --model_path "/path/to/Qwen3-Next-80B-A3B-Instruct-4bit-gs64" \
         | 
| 82 | 
            +
              --fast --progress
         | 
| 83 | 
            +
            ```
         | 
| 84 | 
            +
             | 
| 85 | 
            +
            ## Conversion details (provenance)
         | 
| 86 | 
            +
             | 
| 87 | 
            +
            ```bash
         | 
| 88 | 
            +
            python -m mlx_lm convert \
         | 
| 89 | 
            +
              --hf-path Qwen3-Next-80B-A3B-Instruct \
         | 
| 90 | 
            +
              --mlx-path /path/to/Qwen3-Next-80B-A3B-Instruct-4bit-gs64 \
         | 
| 91 | 
            +
              -q --q-bits 4 --q-group-size 64
         | 
| 92 | 
            +
            ```
         | 
| 93 | 
            +
             | 
| 94 | 
            +
            - Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.
         | 
| 95 |  | 
| 96 | 
            +
            ## Sibling & reference models
         | 
|  | |
|  | |
|  | |
|  | |
| 97 |  | 
| 98 | 
            +
            - halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64
         | 
| 99 | 
            +
            - halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32
         | 
| 100 | 
            +
             | 
| 101 | 
            +
            ## Verify quantization
         | 
| 102 | 
            +
             | 
| 103 | 
            +
            ```bash
         | 
| 104 | 
            +
            jq '.quantization | {bits, group_size}' /path/to/export/config.json
         | 
| 105 | 
             
            ```
         | 
| 106 | 
            +
             | 
| 107 | 
            +
            ## Limitations and biases
         | 
| 108 | 
            +
             | 
| 109 | 
            +
            Compared to 5‑bit/6‑bit, Q4 may show small but noticeable quality drops on some tasks (for example, perplexity, instruction following). Choose this build for footprint/throughput over maximum accuracy.
         | 
| 110 | 
            +
             | 
| 111 | 
            +
            ## License and credits
         | 
| 112 | 
            +
             | 
| 113 | 
            +
            - License: apache-2.0 (inherits from the base model)
         | 
| 114 | 
            +
            - Base model: Qwen/Qwen3-Next-80B-A3B-Instruct
         | 
| 115 | 
            +
            - Quantization: Halley AI Lab (MLX Q4, gs=64)
         | 
| 116 | 
            +
            - Please cite both the base model and this repository when you use the weights.
         | 

