arxiv:2604.14164

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Published on Mar 23

· Submitted by

Zixian Huang on Apr 17

Upvote

Authors:

Zixian Huang ,

Abstract

Teacher-student cooperation data synthesis framework addresses stylistic divergence in synthetic data for improved model fine-tuning performance.

AI-generated summary

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

View arXiv page View PDF GitHub 8 Add to collection

Community

njuhzx

Paper author Paper submitter 1 day ago

🚀 Motivation

Training reasoning models (e.g., Qwen3) is highly sensitive to the data distribution. We observe that:

❗ Using off-policy data (e.g., directly from a strong teacher model) for SFT can lead to severe catastrophic forgetting, especially for complex reasoning tasks.

💡 Key Idea

To address this critical issue, we propose TESSY, a novel Teacher–Student Cooperative Data Synthesis framework designed to generate on-policy training data. Instead of relying on a teacher model to fully generate training samples, TESSY decouples the generation process into two distinct parts:

🧠 Teacher model → specializes in generating capability tokens.
✍️ Student model → focuses on generating style tokens (e.g., Hmm, Wait...).

This cooperative approach ensures:

Alignment with student distribution (on-policy): The synthesized data is tailored to the student model's own generation patterns.
Preservation of teacher reasoning quality: The teacher's advanced reasoning capabilities are effectively leveraged and maintained.