You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-VL-8B-GRPO

This model is a fine-tuned version of unsloth/Qwen3-VL-8B-Instruct using GRPO (Gradient Reward Policy Optimization).

Training Process

  1. SFT (Supervised Fine-Tuning): Initial fine-tuning on instruction-following dataset
  2. GRPO (Gradient Reward Policy Optimization): Reinforcement learning from preference data

Model Details

  • Base Model: Qwen3-VL-8B-Instruct
  • Training Method: GRPO (custom implementation without DPO Trainer)
  • Parameter Count: 8B
  • Quantization: Trained with 4-bit quantization, merged to full precision
  • Format: Safetensors

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/qwen3-vl-8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("your-username/qwen3-vl-8b-grpo")

# For text-only inference
prompt = "Human: Explain photosynthesis in simple terms.\n\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

  • Loss: Cross-entropy with reward modeling
  • Final Loss: 0.6478
  • Final Reward Diff: 0.9294
  • Training Framework: Unsloth
  • Hardware: NVIDIA RTX A6000

Limitations

  • This is primarily optimized for text-only tasks despite being a vision-language model
  • Vision capabilities are inherited from the base model but not specifically fine-tuned

Citation

If you use this model, please cite:

@misc{qwen3-vl-grpo,
  title={Qwen3-VL-8B-GRPO: Vision-Language Model with RLHF},
  year={2026},
  publisher={Hugging Face}
}
Downloads last month
20
Safetensors
Model size
9B params
Tensor type
F16
·
Video Preview
loading

Model tree for chivier/BioVLM_8B-V1

Finetuned
(103)
this model