Abstract
A synthetic reasoning dataset called CHIMERA is introduced to overcome data-centric challenges in training large language models for cross-domain reasoning, achieving performance comparable to much larger models.
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
Community
We introduce CHIMERA, a compact but high-difficulty synthetic reasoning dataset with long Chain-of-Thought trajectories and broad multi-disciplinary coverage, designed for reasoning post-training of large language models.
Dataset
CHIMERA contains 9,225 expert-level problems spanning 8 subjects (Mathematics, Computer Science, Chemistry, Physics, Literature, History, Biology, Linguistics) and 1,179 fine-grained topics, all synthesized by GPT-5. Each problem comes with:
- A concise ground-truth answer and an authoritative reference solution (both GPT-5-generated),
- A long-form model solution with thinking traces from Qwen3-235B-A22B-Thinking-2507 or Qwen3.5-397B-A17B,
- Automated correctness labels from a GPT-5 + o4-mini verification panel.
Unlike existing reasoning datasets that are heavily math-focused or limited in solution length, CHIMERA provides structured domain diversity and long-horizon reasoning traces without any human annotation. Try our dataset here TianHongZXY/CHIMERA 🤗.
Models
We train Qwen3-4B-Thinking-2507 on CHIMERA through SFT followed by RL, yielding consistent gains across all major reasoning benchmarks:
| Benchmark | Qwen3-4B-Thinking-2507 | CHIMERA-4B-SFT | CHIMERA-4B-RL |
|---|---|---|---|
| GPQA-Diamond | 65.8 | 68.8 | 70.1 |
| AIME 2024 | 81.6 | 86.5 | 86.9 |
| AIME 2025 | 81.0 | 79.8 | 80.7 |
| AIME 2026 | 80.8 | 80.3 | 82.7 |
| HMMT Feb 2025 | 59.2 | 63.1 | 65.7 |
| HMMT Nov 2025 | 57.3 | 66.3 | 67.0 |
| HLE | 7.3 | 9.0 | 9.0 |
Models: CHIMERA-4B-SFT | CHIMERA-4B-RL
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper