--- language: - en license: mit pipeline_tag: text-generation library_name: transformers --- # PromptCoT-2.0-SelfPlay-4B This model is part of **PromptCoT 2.0** (*Scaling Prompt Synthesis for LLM Reasoning*). It is a **4B model trained via self-play**, where synthesized problems from PromptCoT 2.0 provide **verifiable feedback** (unit tests for code, boxed answers for math). The training loop uses **Direct Preference Optimization (DPO)** to align generations with automatically verified outcomes, removing the dependence on stronger external teachers. This model establishes **new state-of-the-art performance at the 4B scale**, consistently outperforming strong open-source baselines and curated datasets. --- ## โœจ Highlights - **Self-Play Training**: The model improves autonomously using **synthetic math & code problems** generated by PromptCoT 2.0. Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness). - **Strong Baseline Improvements**: Outperforms **Qwen3-4B-Thinking-2507** and surpasses curated datasets such as **OpenMathReasoning**, **OpenCodeReasoning**, and **OpenThoughts3** across all six benchmarks. --- ## ๐Ÿ“Š Results Evaluation on six benchmarks under the **self-play setting with 4B parameters**. **Bold = best**, *Italic = second-best*. | Model | AIME 24 | AIME 25 | HMMT Feb 25 | LiveCodeBench v5 (2408โ€“2502) | LiveCodeBench v6 (2502โ€“2505) | Codeforces | |------------------------------|---------|---------|-------------|-------------------------------|-------------------------------|------------| | Qwen3-4B-Thinking-2507 | 85.2 | 81.3 | 55.5 | 63.8 | 55.2 | 1852 | | OpenCodeReasoning | 83.1 | 78.5 | 50.4 | 64.4 | *57.1* | 1867 | | OpenMathReasoning | *85.3* | *83.0* | 56.8 | 59.7 | 48.5 | 1826 | | OpenThoughts3 | 84.7 | 80.6 | 54.2 | *65.2* | 54.4 | 1846 | | OpenR1 | 84.6 | 80.9 | 56.7 | 63.0 | 54.6 | 1829 | | PromptCoT 1.0 | *85.3* | 81.8 | *58.6* | 64.5 | 56.7 | *1878* | | **PromptCoT 2.0** | **87.3**| **85.0**| **66.5** | **67.7** | **61.1** | **1934** | --- ## ๐Ÿ”ฎ Key Takeaways * **Best across all six benchmarks**: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces. * **Large gains on high-difficulty tasks**: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best. * **Beyond curated baselines**: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3โ€”which saturate on strong 4B basesโ€”PromptCoT 2.0 continues to deliver significant improvements. --- ## ๐Ÿ“‚ Resources * ๐Ÿ“„ Paper: [PromptCoT 2.0](https://arxiv.org/abs/2509.19894) * ๐Ÿ’ป GitHub: [inclusionAI/PromptCoT](https://github.com/inclusionAI/PromptCoT) * ๐Ÿ“Š Dataset: [PromptCoT-2.0-SelfPlay-4B-48K](https://huggingface.co/datasets/xl-zhao/PromptCoT-2.0-SelfPlay-4B-48K) --- ## ๐Ÿ“œ Citation If you find this model useful, please consider citing: ````bibtex @article{zhao2025promptcot2, title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning}, author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng}, journal = {arXiv preprint arXiv:2509.19894}, year = {2025}, url = {https://arxiv.org/abs/2509.19894} } ````