arxiv:2603.00889

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Published on Mar 1

· Submitted by

Xinyu Zhu on Mar 3

Apple

Upvote

Authors:

Xinyu Zhu ,

Abstract

A synthetic reasoning dataset called CHIMERA is introduced to overcome data-centric challenges in training large language models for cross-domain reasoning, achieving performance comparable to much larger models.

AI-generated summary

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.

View arXiv page View PDF Project page Add to collection

Community

TianHongZXY

Paper author Paper submitter about 7 hours ago

We introduce CHIMERA, a compact but high-difficulty synthetic reasoning dataset with long Chain-of-Thought trajectories and broad multi-disciplinary coverage, designed for reasoning post-training of large language models.

Dataset

CHIMERA contains 9,225 expert-level problems spanning 8 subjects (Mathematics, Computer Science, Chemistry, Physics, Literature, History, Biology, Linguistics) and 1,179 fine-grained topics, all synthesized by GPT-5. Each problem comes with:

A concise ground-truth answer and an authoritative reference solution (both GPT-5-generated),
A long-form model solution with thinking traces from Qwen3-235B-A22B-Thinking-2507 or Qwen3.5-397B-A17B,
Automated correctness labels from a GPT-5 + o4-mini verification panel.

Unlike existing reasoning datasets that are heavily math-focused or limited in solution length, CHIMERA provides structured domain diversity and long-horizon reasoning traces without any human annotation. Try our dataset here TianHongZXY/CHIMERA 🤗.

Models

We train Qwen3-4B-Thinking-2507 on CHIMERA through SFT followed by RL, yielding consistent gains across all major reasoning benchmarks:

Benchmark	Qwen3-4B-Thinking-2507	CHIMERA-4B-SFT	CHIMERA-4B-RL
GPQA-Diamond	65.8	68.8	70.1
AIME 2024	81.6	86.5	86.9
AIME 2025	81.0	79.8	80.7
AIME 2026	80.8	80.3	82.7
HMMT Feb 2025	59.2	63.1	65.7
HMMT Nov 2025	57.3	66.3	67.0
HLE	7.3	9.0	9.0