codelion/gemma-3-1b-it-reasoning-grpo-lora

🧠 Reasoning LoRA with GRPO Training

This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using <think></think> tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data.

🎯 Key Features

  • Structured Thinking: Teaches models to use <think></think> tags for chain-of-thought reasoning
  • GRPO Training: Uses preference learning to optimize reasoning quality
  • Self-Generated Data: No external datasets required - uses Magpie approach
  • Multi-Domain: Effective across mathematics, logic, science, and problem-solving

πŸ“Š Performance Metrics

  • Base Model: google/gemma-3-1b-it
  • Training Method: GRPO (Group Relative Policy Optimization)
  • LoRA Rank: 64
  • LoRA Alpha: 128
  • Training Samples: 107
  • Thinking Tag Usage: 60.0%
  • Average Quality Score: 5.60

πŸ”§ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-1b-it",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# Load reasoning LoRA adapter
model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora")

# Use with thinking prompt
prompt = '''Think step by step and use <think></think> tags to show your reasoning process.

Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel?

Response:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

πŸ“ˆ Expected Output Format

The model will generate responses with structured thinking:

<think>
First, I need to find the train's initial speed.
Speed = Distance / Time = 120 miles / 2 hours = 60 mph

For the first 2 hours: 120 miles
For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph
Distance in third hour: 90 mph Γ— 1 hour = 90 miles

Total distance = 120 + 90 = 210 miles
</think>

To solve this step by step:

First, I'll find the train's initial speed:
- Distance = 120 miles, Time = 2 hours
- Speed = 120 Γ· 2 = 60 mph

Next, I'll calculate the distance for each segment:
- First 2 hours: 120 miles (given)
- Third hour: speed increases by 30 mph β†’ 60 + 30 = 90 mph
- Distance in third hour: 90 Γ— 1 = 90 miles

Total distance = 120 + 90 = 210 miles

πŸ§ͺ Training Details

  • Method: GRPO (Group Relative Policy Optimization)
  • Data Generation: Magpie approach with reasoning-focused prompts
  • Preference Learning: Multiple responses ranked by reasoning quality
  • Domains: Mathematics, logic puzzles, science, programming, philosophy
  • Quality Scoring: Based on thinking tag usage, reasoning markers, and structure

πŸ“š Training Data

The model was trained on self-generated reasoning problems across multiple domains:

  • Mathematical problem-solving
  • Logic puzzles and riddles
  • Scientific analysis
  • Programming challenges
  • Philosophical reasoning
  • Decision-making scenarios

🎭 Reasoning Patterns Learned

  • Step-by-step analysis: Breaking complex problems into smaller parts
  • Causal reasoning: Using "because", "therefore", "since" connections
  • Sequential thinking: "First", "next", "then", "finally" progression
  • Structured output: Clear separation of thinking and final response

πŸ”¬ Evaluation

The adapter was evaluated on diverse reasoning tasks:

  • Thinking tag usage rate: 60.0%
  • Average reasoning quality score: 5.60
  • Response comprehensiveness: 454 words average

🏷️ Related


This adapter is part of the Ellora project - standardized recipes for enhancing LLM capabilities.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for codelion/gemma-3-1b-it-reasoning-grpo-lora

Adapter
(112)
this model

Dataset used to train codelion/gemma-3-1b-it-reasoning-grpo-lora

Collection including codelion/gemma-3-1b-it-reasoning-grpo-lora