airesupdated-v6

Model Description

This model is a fine-tuned version of Qwen/Qwen3-4B using a novel combination of:

  • Tree-of-Thought (ToT) reasoning
  • GRPO (Group Relative Policy Optimization) fine-tuning
  • Forced Path Differentiation for robust DPO training
  • Post-saturation generalization

The model is optimized for structured reasoning tasks, particularly mathematical problem-solving.

Training Details

Training Configuration

  • Base Model: Qwen/Qwen3-4B
  • Method: Tree-of-Thought + GRPO
  • Episodes: 5 (budget-optimized)
  • Datasets:
    • HuggingFaceH4/MATH-500
    • SAGI-1/reasoningData_200k
  • Training Samples: 19 high-quality examples
  • Loss Reduction: 97% (2.24 โ†’ 0.0695)
  • Trainable Parameters: 33M / 4B (0.81% via LoRA)

Key Innovations

  1. Hybrid Reward System: Combines correctness, format, semantic similarity
  2. Forced Path Differentiation: Ensures DPO triplets
  3. Adaptive Exploration: Dynamic temperature adjustment
  4. Budget Optimization: 60% cost reduction

Training Results

  • DPO Triplet Success Rate: 100%
  • Data Parse Success Rate: 100%
  • Final Loss: 0.0695

Usage

Quick Start (with adapter)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    torch_dtype=torch.float32,
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

# Load trained adapter
model = PeftModel.from_pretrained(base_model, "ziadrone/airesupdated-v6")

# Generate
prompt = "Solve: 3x + 5 = 20"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0]))

Expected Output Format

<think>
[Step-by-step reasoning]
</think>
<answer>
[Final answer]
</answer>

Performance

  • Training Loss: 2.24 โ†’ 0.0695 (97% reduction)
  • DPO Success Rate: 100%
  • Cost Reduction: 60%

Citation

@misc{airesupdated_v6},
  author = {ziadrone},
  title = {airesupdated-v6: ToT Reasoning with GRPO},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ziadrone/airesupdated-v6}}
}

License

Apache 2.0 (inherited from base model)


Last Updated: 2025-11-05

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ziadrone/airesupdated-v6

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Adapter
(93)
this model