Qwen3-VL-8B-GRPO
This model is a fine-tuned version of unsloth/Qwen3-VL-8B-Instruct using GRPO (Gradient Reward Policy Optimization).
Training Process
- SFT (Supervised Fine-Tuning): Initial fine-tuning on instruction-following dataset
- GRPO (Gradient Reward Policy Optimization): Reinforcement learning from preference data
Model Details
- Base Model: Qwen3-VL-8B-Instruct
- Training Method: GRPO (custom implementation without DPO Trainer)
- Parameter Count: 8B
- Quantization: Trained with 4-bit quantization, merged to full precision
- Format: Safetensors
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/qwen3-vl-8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("your-username/qwen3-vl-8b-grpo")
# For text-only inference
prompt = "Human: Explain photosynthesis in simple terms.\n\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Details
- Loss: Cross-entropy with reward modeling
- Final Loss: 0.6478
- Final Reward Diff: 0.9294
- Training Framework: Unsloth
- Hardware: NVIDIA RTX A6000
Limitations
- This is primarily optimized for text-only tasks despite being a vision-language model
- Vision capabilities are inherited from the base model but not specifically fine-tuned
Citation
If you use this model, please cite:
@misc{qwen3-vl-grpo,
title={Qwen3-VL-8B-GRPO: Vision-Language Model with RLHF},
year={2026},
publisher={Hugging Face}
}
- Downloads last month
- 20
Model tree for chivier/BioVLM_8B-V1
Base model
Qwen/Qwen3-VL-8B-Instruct
Finetuned
unsloth/Qwen3-VL-8B-Instruct