---
language: en
license: apache-2.0
library_name: peft
tags:
- paligemma
- visual-question-answering
- vqa
- clevr
- qlora
- multimodal
- peft
base_model: google/paligemma-3b-pt-224
datasets:
- leonardPKU/clevr_cogen_a_train
pipeline_tag: visual-question-answering
---

# QLoRA Fine-tuned PaliGemma-3B for Visual Reasoning on CLEVR-CoGen

This repository contains the QLoRA adapters for the `google/paligemma-3b-pt-224` model, fine-tuned for a Visual Question Answering (VQA) task on the `leonardPKU/clevr_cogen_a_train` dataset.

This fine-tuned model demonstrates significantly improved performance on questions requiring spatial and logical reasoning about complex scenes with multiple objects compared to the base PaliGemma model. The use of QLoRA (4-bit quantization) makes it possible to run and train this powerful model on consumer-grade hardware.

## Model Description

- **Base Model:** `google/paligemma-3b-pt-224`
- **Fine-tuning Technique:** QLoRA (Quantized Low-Rank Adaptation)
- **Task:** Visual Question Answering (VQA)
- **Dataset:** A subset of `leonardPKU/clevr_cogen_a_train`
- **Key Improvement:** Enhanced ability to perform complex reasoning, counting, and attribute identification in visual scenes.

## How to Use

To use this model, you must load the 4-bit quantized base model and then apply the PEFT adapters from this repository.

### Installation

First, ensure you have the necessary libraries installed:
```bash
pip install -q transformers peft bitsandbytes accelerate Pillow requests