Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
---
|
| 6 |
+
# PromptCoT-2.0-SelfPlay-4B
|
| 7 |
+
|
| 8 |
+
This model is part of **PromptCoT 2.0** (*Scaling Prompt Synthesis for LLM Reasoning*).
|
| 9 |
+
It is a **4B model trained via self-play**, where synthesized problems from PromptCoT 2.0 provide **verifiable feedback** (unit tests for code, boxed answers for math).
|
| 10 |
+
The training loop uses **Direct Preference Optimization (DPO)** to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.
|
| 11 |
+
|
| 12 |
+
This model establishes **new state-of-the-art performance at the 4B scale**, consistently outperforming strong open-source baselines and curated datasets.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## ✨ Highlights
|
| 17 |
+
|
| 18 |
+
- **Self-Play Training**:
|
| 19 |
+
The model improves autonomously using **synthetic math & code problems** generated by PromptCoT 2.0.
|
| 20 |
+
Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).
|
| 21 |
+
|
| 22 |
+
- **Strong Baseline Improvements**:
|
| 23 |
+
Outperforms **Qwen3-4B-Thinking-2507** and surpasses curated datasets such as **OpenMathReasoning**, **OpenCodeReasoning**, and **OpenThoughts3** across all six benchmarks.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 📊 Results
|
| 28 |
+
|
| 29 |
+
Evaluation on six benchmarks under the **self-play setting with 4B parameters**.
|
| 30 |
+
**Bold = best**, *Italic = second-best*.
|
| 31 |
+
|
| 32 |
+
| Model | AIME 24 | AIME 25 | HMMT Feb 25 | LiveCodeBench v5 (2408–2502) | LiveCodeBench v6 (2502–2505) | Codeforces |
|
| 33 |
+
|------------------------------|---------|---------|-------------|-------------------------------|-------------------------------|------------|
|
| 34 |
+
| Qwen3-4B-Thinking-2507 | 85.2 | 81.3 | 55.5 | 63.8 | 55.2 | 1852 |
|
| 35 |
+
| OpenCodeReasoning | 83.1 | 78.5 | 50.4 | 64.4 | *57.1* | 1867 |
|
| 36 |
+
| OpenMathReasoning | *85.3* | *83.0* | 56.8 | 59.7 | 48.5 | 1826 |
|
| 37 |
+
| OpenThoughts3 | 84.7 | 80.6 | 54.2 | *65.2* | 54.4 | 1846 |
|
| 38 |
+
| OpenR1 | 84.6 | 80.9 | 56.7 | 63.0 | 54.6 | 1829 |
|
| 39 |
+
| PromptCoT 1.0 | *85.3* | 81.8 | *58.6* | 64.5 | 56.7 | *1878* |
|
| 40 |
+
| **PromptCoT 2.0** | **87.3**| **85.0**| **66.5** | **67.7** | **61.1** | **1934** |
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
## 🔮 Key Takeaways
|
| 46 |
+
|
| 47 |
+
* **Best across all six benchmarks**: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
|
| 48 |
+
* **Large gains on high-difficulty tasks**: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
|
| 49 |
+
* **Beyond curated baselines**: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## 📂 Resources
|
| 54 |
+
|
| 55 |
+
* 📄 Paper: [PromptCoT 2.0](https://arxiv.org/abs/2509.19894)
|
| 56 |
+
* 💻 GitHub: [inclusionAI/PromptCoT](https://github.com/inclusionAI/PromptCoT)
|
| 57 |
+
* 📊 Dataset: [PromptCoT-2.0-SelfPlay-4B-48K](https://huggingface.co/datasets/xl-zhao/PromptCoT-2.0-SelfPlay-4B-48K)
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## 📜 Citation
|
| 62 |
+
|
| 63 |
+
If you find this model useful, please consider citing:
|
| 64 |
+
|
| 65 |
+
````bibtex
|
| 66 |
+
@article{zhao2025promptcot2,
|
| 67 |
+
title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
|
| 68 |
+
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
|
| 69 |
+
journal = {arXiv preprint arXiv:2509.19894},
|
| 70 |
+
year = {2025},
|
| 71 |
+
url = {https://arxiv.org/abs/2509.19894}
|
| 72 |
+
}
|
| 73 |
+
````
|