Cernis-Legal-OCR
A specialized vision-language model fine-tuned for legal document OCR, built on Qwen2.5-VL-7B-Instruct.
Model Description
Cernis-Legal-OCR is a LoRA fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for extracting text from legal documents including court filings, case law, contracts, and other legal materials. The model was trained on 5,000 synthetic legal documents generated from the Caselaw Access Project dataset.
Key Features:
- Handles dense legal text with complex formatting
- Robust to scan artifacts, photocopying effects, and document degradation
- Preserves legal document structure and formatting
- Processes documents with stamps, margin notes, and annotations
Training Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Training Data: 5,000 synthetic legal documents from Caselaw Access Project
Intended Use
This model is designed for:
- Legal document digitization and OCR
- Converting scanned case law into searchable text
- Processing court filings and legal contracts
- Legal research and document analysis workflows
How to Use
from unsloth import FastVisionModel
from transformers import AutoTokenizer
from PIL import Image
# Load model and tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
"coolAI/cernis-legal-ocr",
load_in_4bit=True,
)
FastVisionModel.for_inference(model)
# Prepare input
image = Image.open("legal_document.png")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract all text from this legal document."}
]
}]
# Generate
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Limitations
- Trained on synthetic data; may not generalize perfectly to all real-world scanned documents
- Limited to English legal documents
- Best performance on documents similar to US case law formatting
- May require additional fine-tuning for specific legal document types
Training Data
The model was trained on synthetic legal documents generated from the Caselaw Access Project dataset. Documents were rendered with realistic legal formatting and augmented with scan artifacts, photocopying effects, stamps, and margin notes to simulate real-world OCR conditions.
Citation
If you use this model, please cite:
@misc{cernis-legal-ocr,
title={Cernis-Legal-OCR: A Specialized Vision Model for Legal Document Processing},
author={Cernis AI},
year={2025},
howpublished={\url{https://huggingface.co/coolAI/cernis-legal-ocr}}
}
Acknowledgments
Built using Unsloth for efficient fine-tuning and trained on data from the Caselaw Access Project.
- Downloads last month
- 37
Model tree for cernis-intelligence/cernis-legal-ocr
Base model
Qwen/Qwen2.5-VL-7B-Instruct