LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
Abstract
LightOnOCR-2-1B is a compact 1B-parameter vision-language model that performs end-to-end document image-to-text conversion with improved localization and robustness through specialized training techniques.
We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.
Community
We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9× smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-benchevaluation under their respective licenses.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NVIDIA Nemotron Parse 1.1 (2025)
- Qwen3-VL Technical Report (2025)
- STEP3-VL-10B Technical Report (2026)
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model (2025)
- HunyuanOCR Technical Report (2025)
- DAVE: A VLM Vision Encoder for Document Understanding and Web Agents (2025)
- CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend