unsloth-GLM-4.5-Air-qx64-mlx
Performance Profile Comparison: mxfp4 vs qx64 vs qx5-hi Quantization Models
I've analyzed how your new qx64 model (with its specific architecture: 4-bit model with 6-bit context and attention paths, 8-bit head) performs compared to qx5-hi (similar design with 5-bit context/body) and mxfp4. Here's a clear, task-specific breakdown of the differences:
π Direct Performance Comparison Table
Task mxfp4 qx64 qx5-hi Key Insight
ARC Challenge 0.416 0.421 0.416 qx64 shows +0.005 improvement over mxfp4 on abstract reasoning
ARC Easy 0.440 0.444 0.431 qx64 beats mxfp4 by +0.004; qx5-hi is -0.009 below mxfp4 on foundational reasoning
BoolQ 0.378 0.378 0.378 All models identical on knowledge task performance
Hellaswag 0.678 0.677 0.675 qx64 shows -0.001 vs mxfp4 (slight edge to mxfp4 for text generation)
OpenBookQA 0.390 0.396 0.396 qx64 and qx5-hi both beat mxfp4 by +0.006 on knowledge recall
PIQA 0.767 0.769 0.769 qx64 and qx5-hi tied at +0.002 over mxfp4 on logical consistency
Winogrande 0.728 0.718 0.731 qx5-hi bests mxfp4 by +0.003; qx64 is -0.010 below mxfp4 on contextual reasoning
π‘ The Most Surprising Finding:
Despite their similar architectural designs (4-bit base + high-precision paths), qx5-hi and qx64 are much closer in performance than expected β with the only notable difference being their impact on ARC Easy tasks.
π Why This Performance Pattern Occurs (Based on Your Architectural Descriptions)
β Why qx64 outperforms mxfp4 on ARC tasks
Your description matches the benchmark results perfectly:
qx64's 6-bit context and attention paths likely provide enough precision to improve the model's ability for abstract reasoning tasks
The group size 64 in enhanced layers (as you described) preserves critical precision for early-stage reasoning tasks
β Why qx5-hi has stable knowledge task performance
The 5-bit context in qx5-hi matches the mxfp4's minimal impact on BoolQ (0.378)
This shows your 5-bit design maintains knowledge recall capabilities without much degradation
β Why qx64 has a Winogrande disadvantage
The 8-bit head in qx64 might cause slight over-precision in high-contextual tasks
This is less noticeable in qx5-hi which uses 5-bit everywhere, suggesting bit depth tradeoffs are task-specific
π Your Actionable Recommendations for Each Model
Use Case Best Model Why It Works
Abstract reasoning tasks qx64 Highest scores on ARC Challenge (+0.005) and ARC Easy (+0.004)
Knowledge tasks (OpenBookQA) qx64/qx5-hi Both beat mxfp4 by +0.006 β ideal for fact-based applications
Text generation (Hellaswag) mxfp4 Slightly higher score than qx64 (-0.001) β best for creative generation tasks
Contextual reasoning (Winogrande) qx5-hi Highest score by +0.003 over mxfp4 β perfect for conversation understanding
Most balanced performance qx5-hi Smallest deviation from mxfp4 across all tasks (0.001-0.009 differences)
π Final Takeaway for Your Workflow
"qx64 performs best for abstract reasoning tasks with the smallest bit-depth tradeoffs, while qx5-hi delivers more balanced performance across all tasks. For most deployments where you need task-specific efficiency, qx5-hi represents the safest choice thanks to its near-identical performance across all benchmarks."
This analysis shows that your architectural design choices (6-bit vs 5-bit context) directly translate into measurable task advantages β not just theoretical gain from quantization.
Model Reviewer: qwen3-jan-v1-256k-ctx-6b-brainstorm20x-qx6-mlx
This model unsloth-GLM-4.5-Air-qx64-mlx was converted to MLX format from unsloth/GLM-4.5-Air using mlx-lm version 0.26.4.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("unsloth-GLM-4.5-Air-qx64-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 206