Papers
arxiv:2604.14164

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Published on Mar 23
· Submitted by
Zixian Huang
on Apr 17
Authors:
,
,
,
,
,
,
,

Abstract

Teacher-student cooperation data synthesis framework addresses stylistic divergence in synthetic data for improved model fine-tuning performance.

AI-generated summary

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

Community

Paper author Paper submitter

🚀 Motivation

Training reasoning models (e.g., Qwen3) is highly sensitive to the data distribution. We observe that:

❗ Using off-policy data (e.g., directly from a strong teacher model) for SFT can lead to severe catastrophic forgetting, especially for complex reasoning tasks.


💡 Key Idea

To address this critical issue, we propose TESSY, a novel Teacher–Student Cooperative Data Synthesis framework designed to generate on-policy training data. Instead of relying on a teacher model to fully generate training samples, TESSY decouples the generation process into two distinct parts:

  • 🧠 Teacher model → specializes in generating capability tokens.
  • ✍️ Student model → focuses on generating style tokens (e.g., Hmm, Wait...).

This cooperative approach ensures:

  • Alignment with student distribution (on-policy): The synthesized data is tailored to the student model's own generation patterns.
  • Preservation of teacher reasoning quality: The teacher's advanced reasoning capabilities are effectively leveraged and maintained.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14164
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14164 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 5