Introducing Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3
This model extends Qwen2.5-Math-1.5B by training in two stages:
- Supervised Fine-Tuning (SFT) on high-quality, self-verified GSM8K chains-of-thought. Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10
- Direct Preference Optimization (DPO) using correct and incorrect answer pairs. (this model)
Evaluation (GSM8K Pass@1)
| Model | Pass@1 |
|---|---|
| Base Qwen2.5-Math-1.5B | ~54% |
| SFT checkpoint | ~67.5% |
| After SFT + DPO (this model) | ~70% |
Evaluation was run on the GSM8K test split with:temperature=0.1, top_p=1.0, greedy final answer extraction.
Training Summary
Stage 1 β Supervised Fine-Tuning (SFT)
- Dataset: curated GSM8K subset with verified chains-of-thought (Math-Verify)
- Epochs: 10
- Learning rate: 3e-6
- Batch size: 4
- Gradient accumulation: 4
- Data process:
- For each GSM8K problem, samples were generated
- Samples were automatically checked using Math-Verify
- Only correct & well-formatted answers were used as SFT targets
Stage 2 β Direct Preference Optimization (DPO)
- Dataset size: ~1,000 preference pairs
- Composition: mostly hard pairs with correct and incorrect CoT used as chosen and rejected pair respectively, minimal soft pairs (shorter correct CoT preferred)
- Pair construction:
- From the SFT model, 4 samples were generated per question from GSM8K train set
- Chosen: correct answer
- Rejected: incorrect or formatting-invalid answer
- Parameters:
- Epochs: 3
- Ξ² (dpo-beta): 0.1
- LR: 3e-6
What This Model Improves
- Slightly higher accuracy than arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10 through DPO
- Cleaner and more stable generation of
<think>and<answer>blocks - Slightly shorter chains-of-thought than arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10 due to soft-length preferences
- Downloads last month
- 28
Model tree for arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3
Base model
Qwen/Qwen2.5-1.5B
Finetuned
Qwen/Qwen2.5-Math-1.5B
Finetuned
arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10