Cernis-Legal-OCR

A specialized vision-language model fine-tuned for legal document OCR, built on Qwen2.5-VL-7B-Instruct.

Model Description

Cernis-Legal-OCR is a LoRA fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for extracting text from legal documents including court filings, case law, contracts, and other legal materials. The model was trained on 5,000 synthetic legal documents generated from the Caselaw Access Project dataset.

Key Features:

  • Handles dense legal text with complex formatting
  • Robust to scan artifacts, photocopying effects, and document degradation
  • Preserves legal document structure and formatting
  • Processes documents with stamps, margin notes, and annotations

Training Details

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Training Data: 5,000 synthetic legal documents from Caselaw Access Project

Intended Use

This model is designed for:

  • Legal document digitization and OCR
  • Converting scanned case law into searchable text
  • Processing court filings and legal contracts
  • Legal research and document analysis workflows

How to Use

from unsloth import FastVisionModel
from transformers import AutoTokenizer
from PIL import Image

# Load model and tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
    "coolAI/cernis-legal-ocr",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Prepare input
image = Image.open("legal_document.png")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extract all text from this legal document."}
    ]
}]

# Generate
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Limitations

  • Trained on synthetic data; may not generalize perfectly to all real-world scanned documents
  • Limited to English legal documents
  • Best performance on documents similar to US case law formatting
  • May require additional fine-tuning for specific legal document types

Training Data

The model was trained on synthetic legal documents generated from the Caselaw Access Project dataset. Documents were rendered with realistic legal formatting and augmented with scan artifacts, photocopying effects, stamps, and margin notes to simulate real-world OCR conditions.

Citation

If you use this model, please cite:

@misc{cernis-legal-ocr,
  title={Cernis-Legal-OCR: A Specialized Vision Model for Legal Document Processing},
  author={Cernis AI},
  year={2025},
  howpublished={\url{https://huggingface.co/coolAI/cernis-legal-ocr}}
}

Acknowledgments

Built using Unsloth for efficient fine-tuning and trained on data from the Caselaw Access Project.

Downloads last month
37
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cernis-intelligence/cernis-legal-ocr

Adapter
(4)
this model