You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-VL-8B-GRPO

This model is a fine-tuned version of unsloth/Qwen3-VL-8B-Instruct using GRPO (Gradient Reward Policy Optimization).

Training Process

SFT (Supervised Fine-Tuning): Initial fine-tuning on instruction-following dataset
GRPO (Gradient Reward Policy Optimization): Reinforcement learning from preference data

Model Details

Base Model: Qwen3-VL-8B-Instruct
Training Method: GRPO (custom implementation without DPO Trainer)
Parameter Count: 8B
Quantization: Trained with 4-bit quantization, merged to full precision
Format: Safetensors

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/qwen3-vl-8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("your-username/qwen3-vl-8b-grpo")

# For text-only inference
prompt = "Human: Explain photosynthesis in simple terms.\n\nAssistant: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Loss: Cross-entropy with reward modeling
Final Loss: 0.6478
Final Reward Diff: 0.9294
Training Framework: Unsloth
Hardware: NVIDIA RTX A6000

Limitations

This is primarily optimized for text-only tasks despite being a vision-language model
Vision capabilities are inherited from the base model but not specifically fine-tuned

Citation

If you use this model, please cite:

@misc{qwen3-vl-grpo,
  title={Qwen3-VL-8B-GRPO: Vision-Language Model with RLHF},
  year={2026},
  publisher={Hugging Face}
}

Downloads last month: 20

Safetensors

Model size

9B params

Tensor type

F16

Video Preview

Reinforcement Learning

Model tree for chivier/BioVLM_8B-V1

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

unsloth/Qwen3-VL-8B-Instruct

Finetuned

(103)

this model