metadata

language: en
license: other
datasets:
  - DeepMount00/ner_training
tags:
  - vision
  - multimodal
  - OCR
  - SmolVLM
pipeline_tag: text-generation

SmolVLM Base - OCR Fine-tuned

This is a merged version of SmolVLM-Base fine-tuned for OCR tasks. The model was trained using QLoRA on the DeepMount00/ner_training dataset.

Model Details

Base Model: HuggingFaceTB/SmolVLM-Base
Task: Optical Character Recognition (OCR)
Training Method: QLoRA with 4-bit quantization
Target Modules: down_proj, o_proj, k_proj, q_proj, gate_proj, up_proj, v_proj

Usage

from transformers import AutoProcessor, Idefics3ForConditionalGeneration
import torch
from PIL import Image

model_id = "DeepMount00/SmolVLM-Base-ocr_base"
processor = AutoProcessor.from_pretrained(model_id)
model = Idefics3ForConditionalGeneration.from_pretrained(model_id)

# Load your image
image = Image.open("path_to_your_image.jpg").convert("RGB")

# Prepare the prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "You are a model specialized in OCR"},
            {"type": "image"},
            {"type": "text", "text": "Extract the text from this image"}
        ]
    }
]

# Process inputs
inputs = processor(text=messages, images=[image], return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    
# Decode and print the response
print(processor.decode(outputs[0], skip_special_tokens=True))