Safetensors
English
Chinese
qwen3

📦 Model Card: Jackrong/GPT-Distill-Qwen3-4B-Thinking-GGUF

Key Property Value
Model ID Jackrong/GPT-Distill-Qwen3-4B-Thinking-GGUF
License apache-2.0
Author(s) Jackrong, gpt‑oss team, Qwen authors
Base Model gpt-oss-120b-high (complex reasoning dataset distilled)
Target Size ~ 4B parameters (Qwen3‑4B distilled version)

🔍 Overview

A deeply distilled and fine-tuned variant of the large‑language model gpt-oss-120b-high, optimized for human‑friendly, high‑fidelity reasoning. The model preserves the original’s multi‑step thinking patterns while compressing them onto a lightweight 4B‑parameter backbone (the “Distill‑Qwen3” architecture). Its signature feature is an explicit point‑by‑point thought chain that makes intricate logic transparent and easy to follow, ideal for education, technical support, and analytical tasks.

💡 Think of it as the “thinking mode” you’d expect from a massive


🛠️ Technical Details

Aspect Specification
Source Model gpt-oss‑120b‑high (complex reasoning dataset distilled)
Distillation Target Qwen3‑4B architecture
Supervised Fine‑Tuning (SFT) ~ 30,000 examples drawn from the source’s high‑fidelity reasoning corpus
Training Hardware Single NVIDIA H100 GPU
Max Context Length 32 768 tokens – enables multi‑paragraph, long‑form reasoning without truncation
Reasoning Style ✅Default: Bullet‑point “thought chain” output (e.g., • Step 1 → …\n• Step 2 → …)
Configuration Value Why this choice matters
Learning Rate 2.0 × 10⁻⁵ A moderate schedule balances fast convergence on high‑fidelity reasoning tasks while avoiding overshooting the distilled architecture’s optimal weights.
Per‑device train batch size 8 Small per‑device batches maximize GPU memory utilization .
Gradient Accumulation Steps 8 → effective batch size = 64 Compensates for limited device memory while maintaining low per‑step compute overhead. The effective batch of 64 tokens provides a meaningful signal without overwhelming VRAM.
Max context length 32,768 tokens Enables multi‑paragraph, long‑form reasoning without truncation – essential for complex technical queries and educational walkthroughs.
Total training examples (SFT) ~ 52,000 high‑fidelity reasoning samples Focuses on preserving intricate logic patterns rather than mass coverage of generic chat data.

💡 These settings were tuned specifically for the 4B‑parameter distilled architecture and the NVIDIA H100‑80GB hardware we used during training, yielding a clear improvement in stepwise reasoning quality over vanilla distillation.


🎯 Recommended Use Cases

Case When to use
Technical tutorials Leverage bullet‑point logic for stepwise code walkthroughs
Complex queries (e.g., math, engineering) The model’s deep reasoning helps avoid oversimplified answers
User education Clear, scannable outputs aid learning and reduce confusion
Moderation/analysis The structured format makes it easier to parse responses programmatically

📚 Credits & Contributors

  • gpt‑oss team: Provided the high‑fidelity complex‑reasoning dataset.
  • Qwen3 authors: Open‑source architecture used as distillation target.
  • Jackrong: Implemented the final SFT and packaging for Hugging Face Hub.

Downloads last month
28
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/GPT-Distill-Qwen3-4B-Thinking

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(334)
this model

Datasets used to train Jackrong/GPT-Distill-Qwen3-4B-Thinking