PromptCoT-2.0-SelfPlay-4B / README.md

nielsr HF Staff

Add pipeline tag, library name, and clean up model card

c5b0299 verified 28 days ago

preview code

raw

history blame

3.88 kB

metadata

language:
  - en
license: mit
pipeline_tag: text-generation
library_name: transformers

PromptCoT-2.0-SelfPlay-4B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning).
It is a 4B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math).
The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.

This model establishes new state-of-the-art performance at the 4B scale, consistently outperforming strong open-source baselines and curated datasets.

✨ Highlights

Self-Play Training:
The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0.
Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).
Strong Baseline Improvements:
Outperforms Qwen3-4B-Thinking-2507 and surpasses curated datasets such as OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3 across all six benchmarks.

📊 Results

Evaluation on six benchmarks under the self-play setting with 4B parameters.
Bold = best, Italic = second-best.

Model	AIME 24	AIME 25	HMMT Feb 25	LiveCodeBench v5 (2408–2502)	LiveCodeBench v6 (2502–2505)	Codeforces
Qwen3-4B-Thinking-2507	85.2	81.3	55.5	63.8	55.2	1852
OpenCodeReasoning	83.1	78.5	50.4	64.4	57.1	1867
OpenMathReasoning	85.3	83.0	56.8	59.7	48.5	1826
OpenThoughts3	84.7	80.6	54.2	65.2	54.4	1846
OpenR1	84.6	80.9	56.7	63.0	54.6	1829
PromptCoT 1.0	85.3	81.8	58.6	64.5	56.7	1878
PromptCoT 2.0	87.3	85.0	66.5	67.7	61.1	1934

🔮 Key Takeaways

Best across all six benchmarks: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
Large gains on high-difficulty tasks: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
Beyond curated baselines: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements.

📂 Resources

📄 Paper: PromptCoT 2.0
💻 GitHub: inclusionAI/PromptCoT
📊 Dataset: PromptCoT-2.0-SelfPlay-4B-48K

📜 Citation

If you find this model useful, please consider citing:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}