Introducing Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

This model extends Qwen2.5-Math-1.5B by training in two stages:

  1. Supervised Fine-Tuning (SFT) on high-quality, self-verified GSM8K chains-of-thought. Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10
  2. Direct Preference Optimization (DPO) using correct and incorrect answer pairs. (this model)

Evaluation (GSM8K Pass@1)

Model Pass@1
Base Qwen2.5-Math-1.5B ~54%
SFT checkpoint ~67.5%
After SFT + DPO (this model) ~70%

Evaluation was run on the GSM8K test split with:
temperature=0.1, top_p=1.0, greedy final answer extraction.

Training Summary

Stage 1 β€” Supervised Fine-Tuning (SFT)

  • Dataset: curated GSM8K subset with verified chains-of-thought (Math-Verify)
  • Epochs: 10
  • Learning rate: 3e-6
  • Batch size: 4
  • Gradient accumulation: 4
  • Data process:
    • For each GSM8K problem, samples were generated
    • Samples were automatically checked using Math-Verify
    • Only correct & well-formatted answers were used as SFT targets

Stage 2 β€” Direct Preference Optimization (DPO)

  • Dataset size: ~1,000 preference pairs
  • Composition: mostly hard pairs with correct and incorrect CoT used as chosen and rejected pair respectively, minimal soft pairs (shorter correct CoT preferred)
  • Pair construction:
    • From the SFT model, 4 samples were generated per question from GSM8K train set
    • Chosen: correct answer
    • Rejected: incorrect or formatting-invalid answer
  • Parameters:
    • Epochs: 3
    • Ξ² (dpo-beta): 0.1
    • LR: 3e-6

What This Model Improves

Downloads last month
28
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1)
this model
Finetunes
1 model

Dataset used to train arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3