nielsr's picture
nielsr HF Staff
Add pipeline tag, library name, and clean up model card
c5b0299 verified
|
raw
history blame
3.88 kB
metadata
language:
  - en
license: mit
pipeline_tag: text-generation
library_name: transformers

PromptCoT-2.0-SelfPlay-4B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning).
It is a 4B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math).
The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.

This model establishes new state-of-the-art performance at the 4B scale, consistently outperforming strong open-source baselines and curated datasets.


โœจ Highlights

  • Self-Play Training:
    The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0.
    Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).

  • Strong Baseline Improvements:
    Outperforms Qwen3-4B-Thinking-2507 and surpasses curated datasets such as OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3 across all six benchmarks.


๐Ÿ“Š Results

Evaluation on six benchmarks under the self-play setting with 4B parameters.
Bold = best, Italic = second-best.

Model AIME 24 AIME 25 HMMT Feb 25 LiveCodeBench v5 (2408โ€“2502) LiveCodeBench v6 (2502โ€“2505) Codeforces
Qwen3-4B-Thinking-2507 85.2 81.3 55.5 63.8 55.2 1852
OpenCodeReasoning 83.1 78.5 50.4 64.4 57.1 1867
OpenMathReasoning 85.3 83.0 56.8 59.7 48.5 1826
OpenThoughts3 84.7 80.6 54.2 65.2 54.4 1846
OpenR1 84.6 80.9 56.7 63.0 54.6 1829
PromptCoT 1.0 85.3 81.8 58.6 64.5 56.7 1878
PromptCoT 2.0 87.3 85.0 66.5 67.7 61.1 1934

๐Ÿ”ฎ Key Takeaways

  • Best across all six benchmarks: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
  • Large gains on high-difficulty tasks: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
  • Beyond curated baselines: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3โ€”which saturate on strong 4B basesโ€”PromptCoT 2.0 continues to deliver significant improvements.

๐Ÿ“‚ Resources


๐Ÿ“œ Citation

If you find this model useful, please consider citing:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}