airesupdated-v6
Model Description
This model is a fine-tuned version of Qwen/Qwen3-4B using a novel combination of:
- Tree-of-Thought (ToT) reasoning
- GRPO (Group Relative Policy Optimization) fine-tuning
- Forced Path Differentiation for robust DPO training
- Post-saturation generalization
The model is optimized for structured reasoning tasks, particularly mathematical problem-solving.
Training Details
Training Configuration
- Base Model: Qwen/Qwen3-4B
- Method: Tree-of-Thought + GRPO
- Episodes: 5 (budget-optimized)
- Datasets:
- HuggingFaceH4/MATH-500
- SAGI-1/reasoningData_200k
- Training Samples: 19 high-quality examples
- Loss Reduction: 97% (2.24 โ 0.0695)
- Trainable Parameters: 33M / 4B (0.81% via LoRA)
Key Innovations
- Hybrid Reward System: Combines correctness, format, semantic similarity
- Forced Path Differentiation: Ensures DPO triplets
- Adaptive Exploration: Dynamic temperature adjustment
- Budget Optimization: 60% cost reduction
Training Results
- DPO Triplet Success Rate: 100%
- Data Parse Success Rate: 100%
- Final Loss: 0.0695
Usage
Quick Start (with adapter)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B",
torch_dtype=torch.float32,
device_map="auto"
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")
# Load trained adapter
model = PeftModel.from_pretrained(base_model, "ziadrone/airesupdated-v6")
# Generate
prompt = "Solve: 3x + 5 = 20"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0]))
Expected Output Format
<think>
[Step-by-step reasoning]
</think>
<answer>
[Final answer]
</answer>
Performance
- Training Loss: 2.24 โ 0.0695 (97% reduction)
- DPO Success Rate: 100%
- Cost Reduction: 60%
Citation
@misc{airesupdated_v6},
author = {ziadrone},
title = {airesupdated-v6: ToT Reasoning with GRPO},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ziadrone/airesupdated-v6}}
}
License
Apache 2.0 (inherited from base model)
Last Updated: 2025-11-05
- Downloads last month
- 55