File size: 2,250 Bytes
8299bc6
fe64308
 
 
 
8299bc6
fe64308
8299bc6
 
fe64308
8299bc6
 
fe64308
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
title: KOSMOS-2.5 Document AI Demo
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
---

# KOSMOS-2.5 Document AI Demo

This Space demonstrates the capabilities of Microsoft's **KOSMOS-2.5**, a multimodal literate model for machine reading of text-intensive images.

## Features

πŸ”₯ **Three powerful modes**:

1. **πŸ“ Markdown Generation**: Convert document images to clean markdown format
2. **πŸ” OCR with Bounding Boxes**: Extract text with precise spatial coordinates and visualization
3. **πŸ’¬ Document Q&A**: Ask questions about document content using KOSMOS-2.5 Chat

## What is KOSMOS-2.5?

KOSMOS-2.5 is Microsoft's latest document AI model that excels at understanding text-rich images. It can:

- Generate spatially-aware text blocks with coordinates
- Produce structured markdown output that captures document styles
- Answer questions about document content through the chat variant

The model was pre-trained on 357.4 million text-rich document images and achieves performance comparable to much larger models (1.3B vs 7B parameters) on visual question answering benchmarks.

## Example Use Cases

- **Receipts**: Extract itemized information or ask "What's the total amount?"
- **Forms**: Convert to structured format or query specific fields
- **Articles**: Get clean markdown or ask content-specific questions
- **Screenshots**: Extract UI text or get information about elements

## Model Information

- **Base Model**: [microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5)
- **Chat Model**: [microsoft/kosmos-2.5-chat](https://huggingface.co/microsoft/kosmos-2.5-chat)
- **Paper**: [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)

## Note

This is a generative model and may occasionally produce inaccurate results. Please verify outputs for critical applications.

## Citation

```bibtex
@article{lv2023kosmos,
  title={Kosmos-2.5: A multimodal literate model},
  author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
  journal={arXiv preprint arXiv:2309.11419},
  year={2023}
}
```