kosmos-2.5-demo / README.md
nielsr's picture
nielsr HF Staff
Add KOSMOS-2.5 Document AI Demo
fe64308

A newer version of the Gradio SDK is available: 5.44.1

Upgrade
metadata
title: KOSMOS-2.5 Document AI Demo
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit

KOSMOS-2.5 Document AI Demo

This Space demonstrates the capabilities of Microsoft's KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images.

Features

πŸ”₯ Three powerful modes:

  1. πŸ“ Markdown Generation: Convert document images to clean markdown format
  2. πŸ” OCR with Bounding Boxes: Extract text with precise spatial coordinates and visualization
  3. πŸ’¬ Document Q&A: Ask questions about document content using KOSMOS-2.5 Chat

What is KOSMOS-2.5?

KOSMOS-2.5 is Microsoft's latest document AI model that excels at understanding text-rich images. It can:

  • Generate spatially-aware text blocks with coordinates
  • Produce structured markdown output that captures document styles
  • Answer questions about document content through the chat variant

The model was pre-trained on 357.4 million text-rich document images and achieves performance comparable to much larger models (1.3B vs 7B parameters) on visual question answering benchmarks.

Example Use Cases

  • Receipts: Extract itemized information or ask "What's the total amount?"
  • Forms: Convert to structured format or query specific fields
  • Articles: Get clean markdown or ask content-specific questions
  • Screenshots: Extract UI text or get information about elements

Model Information

Note

This is a generative model and may occasionally produce inaccurate results. Please verify outputs for critical applications.

Citation

@article{lv2023kosmos,
  title={Kosmos-2.5: A multimodal literate model},
  author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
  journal={arXiv preprint arXiv:2309.11419},
  year={2023}
}