Introducing Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

This model extends Qwen2.5-Math-1.5B by training in two stages:

Supervised Fine-Tuning (SFT) on high-quality, self-verified GSM8K chains-of-thought. Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10
Direct Preference Optimization (DPO) using correct and incorrect answer pairs. (this model)

Evaluation (GSM8K Pass@1)

Evaluation was run on the GSM8K test split with:
temperature=0.1, top_p=1.0, greedy final answer extraction.

Dataset: curated GSM8K subset with verified chains-of-thought (Math-Verify)
Epochs: 10
Learning rate: 3e-6
Batch size: 4
Gradient accumulation: 4
Data process:
- For each GSM8K problem, samples were generated
- Samples were automatically checked using Math-Verify
- Only correct & well-formatted answers were used as SFT targets

Dataset size: ~1,000 preference pairs
Composition: mostly hard pairs with correct and incorrect CoT used as chosen and rejected pair respectively, minimal soft pairs (shorter correct CoT preferred)
Pair construction:
- From the SFT model, 4 samples were generated per question from GSM8K train set
- Chosen: correct answer
- Rejected: incorrect or formatting-invalid answer
Parameters:
- Epochs: 3
- β (dpo-beta): 0.1
- LR: 3e-6

Slightly higher accuracy than arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10 through DPO
Cleaner and more stable generation of <think> and <answer> blocks
Slightly shorter chains-of-thought than arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10 due to soft-length preferences

Safetensors

Model size

2B params

Tensor type

BF16

Base model

Finetuned

Finetuned

Finetuned

(1)

this model

Finetunes