Disappointment in text performance
#1
by
BigBlueWhale
- opened
The Tragedy of the 32B: How Synthetic Sludge Diluted a Miracle
To understand the magnitude of this disappointment, we must first acknowledge the starting point. The original Qwen3-32B (April 2025) was nothing short of a miracle in the open-source landscape. It defined reliability, model stability, and generalization, effectively holding the title of the #1 open-source model. Its "Thinking" variant was a powerhouse of knowledge and coding ability, setting a baseline so high that any regression feels like a betrayal.
Yet, the Qwen3-VL Technical Report reveals a painful reality: while the engineering team managed to resuscitate the "Instruct" line, they simultaneously broke the "Thinking" line.
- The April Miracle vs. The November Regression
The data in the report paints a stark picture of how the "Thinking" capabilities have degraded from the pure-text miracle to the multimodal release. The Qwen3-VL-32B-Thinking model has not just stagnated; in critical reasoning and creative benchmarks, it has actively regressed.
- The LiveBench Collapse: This is the most damning metric. The original Qwen3-32B-Thinking (Text) scored a massive 85.0 on LiveBench (2024-11-25). The new Qwen3-VL-32B-Thinking dropped significantly to 76.8. This is not a margin of error; this is a loss of fundamental generalization ability.
- Creative Writing Decay: The original Qwen3-32B-Thinking delivered a score of 84.4 on Creative Writing v3. The VL variant, despite all its additional training, fell to 83.3. The "soul" of the model—its ability to write with nuance—was dampened.
- Agentic Stagnation: In complex agentic tasks like TAU2-Retail, the text model scored 59.6, while the VL model stagnated at 59.4. The miracle model gained nothing from the new training pipeline.
- The Instruct Anomaly: Repairing the Broken
Conversely, the data confirms that the April 2025 Qwen3-32B-Instruct (non-thinking) was indeed "broken" at release, and the VL training acted as a necessary patch.
- Arena-Hard Resurrection: The original text-only Instruct model scored an abysmal 37.4 on Arena-Hard. The VL training catapulted this to 64.7.
- Math Competency: The text-only Instruct model was barely functional in math (AIME-25 score of 20.2), whereas the VL Instruct model jumped to 66.2.
This creates a frustrating dynamic: The VL training fixed the flawed Instruct model but ruined the perfect Thinking model.
- The Root Cause: Poisoned by 30B-A3B Synthetic Data
The technical report proudly details the use of "Strong-to-Weak Distillation" and extensive pipelines for "synthetic data generation" across math, code, and grounding.
This reliance on synthetic data is the smoking gun. The degradation in the 32B Thinking model's quality strongly suggests that the synthetic training data was generated by the inferior 30B-A3B MoE models.
- The Quality Gap: The report shows the Qwen3-VL-30B-A3B consistently performing lower than the 32B Dense in key reasoning benchmarks (e.g., LiveBench 72.1 vs 76.8).
- The Contamination: By using "distillation" and synthetic pipelines that likely leveraged the computationally cheaper 30B-A3B to generate tokens for the dense model, the "miracle" 32B weights were polluted with 30B-A3B quality output.
There is nothing worse than polluting a dense masterpiece with MoE synthetic sludge. The precipitous drop in LiveBench scores (85.0 \to 76.8) is the definitive proof that the data mixture used to train the Qwen3-VL-32B Thinking variant lacked the purity and reliability of the original April 2025 release.