File size: 4,856 Bytes
2d31fd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: apache-2.0
base_model:
- Qwen/QwQ-32B
library_name: mlx
tags:
- quantization
- mlx-q5
---
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- mlx==0.26.2
- q5
- qwq
- reasoning
- m3-ultra
base_model: Qwen/QwQ-32B
---

# QwQ-32B MLX Q5 Quantization

This is a **Q5 (5-bit) quantized** version of the QwQ-32B reasoning model, optimized for MLX on Apple Silicon. This quantization offers an excellent balance between model quality and size, specifically designed for high-memory Apple Silicon systems like the M3 Ultra.

## Model Details

- **Base Model**: Qwen/QwQ-32B
- **Quantization**: Q5 (5-bit) with group size 64
- **Format**: MLX (Apple Silicon optimized)
- **Size**: 21GB (from original 61GB bfloat16)
- **Compression**: 66% size reduction
- **Architecture**: Qwen2 with reasoning capabilities

## Why Q5?

Q5 quantization provides:
- **Superior quality** compared to Q4 while being smaller than Q6/Q8
- **Optimal size** for 128GB+ Apple Silicon systems
- **Minimal quality loss** - retains ~98% of original model capabilities
- **Fast inference** with MLX's unified memory architecture

## Requirements

- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 13.0+
- Python 3.11+
- MLX 0.26.0+
- mlx-lm 0.22.5+
- 32GB+ RAM recommended (64GB+ for full 128k context)

## Installation

```bash
# Using uv (recommended)
uv add mlx>=0.26.0 mlx-lm transformers

# Or with pip (not tested and obsolete)
pip install mlx>=0.26.0 mlx-lm transformers
```

## Usage

### Direct Generation

```bash
uv run mlx_lm.generate \
  --model LibraxisAI/QwQ-32B-MLX-Q5 \
  --prompt "Solve this step by step: If a train travels 120 km in 2 hours, what is its speed?" \
  --max-tokens 500
```

### Python API

```python
from mlx_lm import load, generate

# Load model
model, tokenizer = load("LibraxisAI/QwQ-32B-MLX-Q5")

# Generate text with reasoning
prompt = "Think step by step: What are the implications of Q5 quantization for LLM deployment?"
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=1000,
    temp=0.7
)
print(response)
```

### HTTP Server

```bash
uv run mlx_lm.server \
  --model LibraxisAI/QwQ-32B-MLX-Q5 \
  --host 0.0.0.0 \
  --port 8080
```

## Performance Benchmarks

Tested on Mac Studio M3 Ultra (512GB):

| Metric | Value |
|--------|-------|
| Model Size | 21GB |
| Peak Memory Usage | ~25GB |
| Generation Speed | ~12-15 tokens/sec |
| Max Context Length | 131,072 tokens (128k) |

## Special Features

QwQ (Qwen with Questions) is designed for:
- **Deep reasoning** and step-by-step problem solving
- **Mathematical reasoning** and logical deduction
- **Code generation** with explanations
- **Self-reflection** and error correction

## Limitations

⚠️ **Important**: This Q5 model as for the release date, of this quant **is NOT compatible** with LM Studio (**yet**), which only supports 2, 3, 4, 6, and 8-bit quantizations & we didn't test it with Ollama or any other inference client. **Use MLX directly or via the MLX server** - we've created a comfortable, `command generation script` to run the server properly.

## Conversion Details

This model was quantized using:
```bash
uv run mlx_lm.convert \
  --hf-path Qwen/QwQ-32B \
  --mlx-path QwQ-32B-MLX-Q5 \
  --dtype bfloat16 \
  -q --q-bits 5 --q-group-size 64
```

## Frontier M3 Ultra Optimization

This model is specifically optimized for the Mac Studio M3 Ultra setup with 512GB unified memory. For best performance:

```python
import mlx.core as mx

# Set memory limits for large models
mx.metal.set_memory_limit(100 * 1024**3)  # 100GB
mx.metal.set_cache_limit(20 * 1024**3)    # 20GB cache
```

## Tools Included

We provide utility scripts for easy model management:

1. **convert-to-mlx.sh** - Command generation tool - convert any model to MLX format with many options of customization and Q5 quantization support on mlx>=0.26.0
2. **mlx-serve.sh** - Launch MLX server with custom parameters

## Historical Note

The LibraxisAI Q5 models were among the **first Q5 quantized MLX models** available on Hugging Face, pioneering the use of 5-bit quantization for Apple Silicon optimization.

## Citation

If you use this model, please cite:

```bibtex
@misc{qwq-32b-q5-mlx,
  author = {LibraxisAI},
  title = {QwQ-32B Q5 MLX - Reasoning Model for Apple Silicon},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/LibraxisAI/QwQ-32B-MLX-Q5}
}
```

## License

This model follows the original QwQ license (Apache-2.0). See the [base model card](https://huggingface.com/Qwen/QwQ-32B) for full details.

## Authors of the repository
[Monika Szymanska](https://github.com/m-szymanska)
[Maciej Gad, DVM](https://div0.space) 

## Acknowledgments

- Apple MLX team and community for the amazing 0.26.0+ framework
- Qwen team for the innovative QwQ reasoning model
- Klaudiusz-AI 🐉