Cernis-Legal-OCR

A specialized vision-language model fine-tuned for legal document OCR, built on Qwen2.5-VL-7B-Instruct.

Model Description

Cernis-Legal-OCR is a LoRA fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for extracting text from legal documents including court filings, case law, contracts, and other legal materials. The model was trained on 5,000 synthetic legal documents generated from the Caselaw Access Project dataset.

Key Features:

Handles dense legal text with complex formatting
Robust to scan artifacts, photocopying effects, and document degradation
Preserves legal document structure and formatting
Processes documents with stamps, margin notes, and annotations

Training Details

Base Model: Qwen2.5-VL-7B-Instruct
Training Data: 5,000 synthetic legal documents from Caselaw Access Project

Intended Use

This model is designed for:

Legal document digitization and OCR
Converting scanned case law into searchable text
Processing court filings and legal contracts
Legal research and document analysis workflows

How to Use

from unsloth import FastVisionModel
from transformers import AutoTokenizer
from PIL import Image

# Load model and tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
    "coolAI/cernis-legal-ocr",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Prepare input
image = Image.open("legal_document.png")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extract all text from this legal document."}
    ]
}]

# Generate
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Limitations

Trained on synthetic data; may not generalize perfectly to all real-world scanned documents
Limited to English legal documents
Best performance on documents similar to US case law formatting
May require additional fine-tuning for specific legal document types

Training Data

The model was trained on synthetic legal documents generated from the Caselaw Access Project dataset. Documents were rendered with realistic legal formatting and augmented with scan artifacts, photocopying effects, stamps, and margin notes to simulate real-world OCR conditions.

Citation

If you use this model, please cite:

@misc{cernis-legal-ocr,
  title={Cernis-Legal-OCR: A Specialized Vision Model for Legal Document Processing},
  author={Cernis AI},
  year={2025},
  howpublished={\url{https://huggingface.co/coolAI/cernis-legal-ocr}}
}

Acknowledgments

Built using Unsloth for efficient fine-tuning and trained on data from the Caselaw Access Project.

Downloads last month: 37

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cernis-intelligence/cernis-legal-ocr

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Quantized

unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit

Adapter

(4)

this model