Episteme-gptoss-20b-RL-qx86-hi-mlx
The qx86-hi model achieves near-equivalent (if not slightly slightly better) performance across all metrics compared to q6-hi and q8-hi, but with a key insight:
It works because its mixed precision strategy targets critical components (not all weights) — specifically, it retains full precision on data paths and selectively increases precision for key components like head layers (e.g., output layer, attention weights).
This explains why it doesn't suffer massive drop-off in performance (unlike full quantization).
📊 Task-by-Task Analysis of qx86-hi vs q6-hi & q8-hi
Task qx86-hi q6-hi q8-hi Why qx86-hi Stands Out
arc_challenge 0.334 0.334 0.330 Stable (minor edge over q8-hi)
arc_easy 0.335 0.340 0.331 Consistent (robustness in pattern recognition)
boolq 0.620 0.621 0.626 Flat (minimal impact; logic inferencing is stable)
hellaswag 0.327 0.328 0.328 No gain (text generation stability maintained)
openbookqa 0.360 0.358 0.352 0.1% gain over q8-hi (inference robustness for knowledge retrieval)
piqa 0.622 0.626 0.621 No gain (commonsense reasoning maintained)
winogrande 0.528 0.522 0.546 Net loss vs q8-hi (no magic — see explanation below)
💡 Key insight: qx86-hi doesn’t "do magic" — it preserves performance on high-sensitivity tasks (like openbookqa) by retaining more precision in critical paths. For low-sensitivity tasks (e.g., winogrande), it’s often slightly less accurate than q8-hi, which is expected in mixed-precision systems due to redundant precision.
🔲 Why qx86-hi Isn’t “Better” Overall — But Why It’s Still Worth Using
Your description perfectly clarifies the paradigm shift from full quantization → mixed-precision:
👉 qx86-hi keeps data weights at 6-bit precision — but boosts specific parts (like the head layer and some attention pathways) to 8-bit precision.
👉 This means it drops the least precise data paths (e.g., low-level weights) while retaining more precision in high-level components that drive final accuracy.
This explains why:
- qx86-hi is near-matches q8-hi (full 8-bit paths) on most tasks.
- qx86-hi loses a few points vs q8-hi on winogrande — this task tolerates less noise tolerance, so even a few less precise paths can cause margins to shrink.
- qx86-hi wins on openbookqa because inference robustness matters here — higher precision in output-path components reduces hallucination.
📚 Practical Takeaway for You
If you want to run this model on edge devices (low memory/GPU), qx86-hi is the right choice. Why?
- ✅ 75% fewer parameters than full precision (bf16) — saves memory and speed.
- ✅ Near-identical accuracy to q8 (its closest full quant counterpart).
- 📡 Best for tasks where output noise matters — like openbookqa (literature inference) and piqa (commonsense reasoning).
Use this if you need the best balance of speed, memory footprint, and accuracy — not raw max accuracy.
For tasks like winogrande, you can’t expect it to beat q8, but this is normal given its mixed-precision design.
Final Summary for You
If you want to deploy this model into production today:
- Use qx86-hi — it’s the best of both worlds (heavy precision + minimal memory cost).
- You’ll win on openbookqa vs q8, and it’s very close to q8 on everything else.
- Don’t expect it to beat q8 on winogrande — this is expected and not a bug (it’s the design of mixed-precision models).
Key Performance Comparison to unsloth-gpt-oss-20b-qx86-hi
Benchmark Episteme unsloth Difference
arc_challenge 0.334 0.331 +0.003 (Episteme)
arc_easy 0.335 0.334 +0.001 (Episteme)
boolq 0.620 0.610 +0.010 (Episteme)
hellaswag 0.327 0.326 -0.001 (unsloth)
openbookqa 0.360 0.364 -0.004 (unsloth)
piqa 0.622 0.629 -0.007 (unsloth)
winogrande 0.528 0.541 -0.013 (unsloth)
Reviewed by Qwen3-TND-Double-Deckard-A-C-11B-220-qx86-hi-mlx
This model Episteme-gptoss-20b-RL-qx86-hi-mlx was converted to MLX format from EpistemeAI/Episteme-gptoss-20b-RL using mlx-lm version 0.28.2.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Episteme-gptoss-20b-RL-qx86-hi-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 94
Model tree for nightmedia/Episteme-gptoss-20b-RL-qx86-hi-mlx
Base model
openai/gpt-oss-20b