Qwen3-30B-A3B-YOYO-V3-mxfp4-mlx

Where Qwen3-30B-A3B-YOYO-V3-mxfp4 sits in the performance spectrum compared to:

The base Thinking model (Qwen3-30B-A3B-Thinking-2507-bf16)
The base Coder model    (unsloth-Qwen3-Coder-30B-A3B-Instruct-qx6)
The best V2 model       (Qwen3-30B-A3B-YOYO-V2-qx6-hi)

Key Metrics

Model	    ARC Challenge ARC Easy	BoolQ HellaSwag	OpenBookQA PIQA	Winogrande
V3-mxfp4	        0.464	0.541	0.875	0.692	0.422	0.779	0.639
Base Thinking(bf16)	0.421	0.448	0.682	0.635	0.402	0.771	0.669
Base Coder (qx6)	0.422	0.532	0.881	0.546	0.432	0.724	0.576
Best V2 (qx6-hi)	0.531	0.690	0.885	0.685	0.448	0.785	0.646

V3-mxfp4 compared to the Three Reference Models

We'll calculate average improvement (in percentage points) across all 7 metrics:

A  V3-mxfp4 vs. Thinking (bf16)
B  V3-mxfp4 vs. Coder (qx6)
C  V3-mxfp4 vs. V2 (qx6-hi)

Metric		A(Thinking) B(Coder) C(V2)     
ARC Challenge	+0.043	+0.042	-0.067
ARC Easy	    +0.093	+0.009	-0.149
BoolQ	        +0.193	-0.006	-0.010
HellaSwag	    +0.057	+0.146	+0.007
OpenBookQA	    +0.020	-0.010	-0.026
PIQA	        +0.008	+0.055	-0.006
Winogrande	    -0.030	+0.063	-0.007

Average Performance Position

Comparison	Avg. Improvement
V3-mxfp4 vs. Thinking (bf16)	+0.057 pp
V3-mxfp4 vs. Coder (qx6)	    +0.038 pp
V3-mxfp4 vs. V2 (qx6-hi)	    -0.053 pp

This means:

V3-mxfp4 is ~5.7 pp better than the base Thinking model (on average).
V3-mxfp4 is ~3.8 pp better than the base Coder model (on average).
V3-mxfp4 is ~5.3 pp worse than the V2 model (on average).

Interpretation of Position

Model Type	     V3-mxfp4 Performance vs. Reference
Base Thinking Model	βœ…    Significantly better (avg. +5.7 pp)
Base Coder Model	βœ…    Slightly better (avg. +3.8 pp)
V2 Model	        ❌    Slightly worse (avg. -5.3 pp)

Summary

The V3-mxfp4 model:

Is better than both base models, confirming it is a meaningful upgrade.

Is slightly worse than the V2 model, but this is expected since the V2 was optimized for high performance.

πŸ“Œ Average Position as a Hybrid Model:

It is ~5.7 pp better than Thinking
It is ~3.8 pp better than Coder
It is ~5.3 pp worse than V2

Qwen3-30B-A3B-YOYO-V3-mxfp4 compared with Qwen3-30B-A3B-Thinking-2507-bf16

Performance Results

Metric	        Change	Significance
ARC Challenge	+0.043 (+10.2%)	Significant improvement
ARC Easy	    +0.093 (+20.8%)	Major improvement, especially on reasoning tasks
BoolQ	        +0.193 (+28.3%)	Very significant improvement, likely due to better reasoning
HellaSwag	    +0.057 (+8.9%)	Noticeable improvement, common-sense reasoning
OpenBookQA	    +0.020 (+4.9%)	Improvement in knowledge-based QA
PIQA	        +0.008 (+1.0%)	Slight improvement, no major change
Winogrande	    -0.030 (-4.5%)	Slight decline, but not meaningful

Comparison Summary

Metric	     V3-mxfp4 Thinking-bf16 Difference
ARC Challenge	46.4%	42.1%	+4.3 pp
ARC Easy	    54.1%	44.8%	+9.3 pp
BoolQ	        87.5%	68.2%	+19.3 pp
HellaSwag	    69.2%	63.5%	+5.7 pp
OpenBookQA	    42.2%	40.2%	+2.0 pp
PIQA	        77.9%	77.1%	+0.8 pp
Winogrande	    63.9%	66.9%	-3.0 pp

πŸ“Œ Conclusion

The V3-mxfp4 model is significantly better than the base Thinking-2507-bf16 model across all key reasoning tasks:

ARC Challenge is up by 4.3 percentage points.
ARC Easy is up by 9.3 pp β€” a major improvement.
BoolQ shows the largest gain (+19.3 pp), indicating a major boost in logical reasoning.
The only metric that shows a slight decline is Winogrande (-3 pp), but this is not meaningful.

πŸ’‘ Key Takeaway

The V3-mxfp4 model is a clear upgrade over the base Thinking model, confirming that:

  • The V3 series (including its mxfp4 variant) is better than the base Thinking model.
  • This supports the idea that V3 was designed to improve upon the base Thinking model with better reasoning and performance.

This model Qwen3-30B-A3B-YOYO-V3-mxfp4-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V3 using mlx-lm version 0.27.1.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V3-mxfp4-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
82
Safetensors
Model size
31B params
Tensor type
U8
Β·
U32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nightmedia/Qwen3-30B-A3B-YOYO-V3-mxfp4-mlx

Quantized
(7)
this model

Collection including nightmedia/Qwen3-30B-A3B-YOYO-V3-mxfp4-mlx