Spaces:
Running
on
Zero
Running
on
Zero
title: KOSMOS-2.5 Document AI Demo | |
emoji: π | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 4.44.0 | |
app_file: app.py | |
pinned: false | |
license: mit | |
# KOSMOS-2.5 Document AI Demo | |
This Space demonstrates the capabilities of Microsoft's **KOSMOS-2.5**, a multimodal literate model for machine reading of text-intensive images. | |
## Features | |
π₯ **Three powerful modes**: | |
1. **π Markdown Generation**: Convert document images to clean markdown format | |
2. **π OCR with Bounding Boxes**: Extract text with precise spatial coordinates and visualization | |
3. **π¬ Document Q&A**: Ask questions about document content using KOSMOS-2.5 Chat | |
## What is KOSMOS-2.5? | |
KOSMOS-2.5 is Microsoft's latest document AI model that excels at understanding text-rich images. It can: | |
- Generate spatially-aware text blocks with coordinates | |
- Produce structured markdown output that captures document styles | |
- Answer questions about document content through the chat variant | |
The model was pre-trained on 357.4 million text-rich document images and achieves performance comparable to much larger models (1.3B vs 7B parameters) on visual question answering benchmarks. | |
## Example Use Cases | |
- **Receipts**: Extract itemized information or ask "What's the total amount?" | |
- **Forms**: Convert to structured format or query specific fields | |
- **Articles**: Get clean markdown or ask content-specific questions | |
- **Screenshots**: Extract UI text or get information about elements | |
## Model Information | |
- **Base Model**: [microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5) | |
- **Chat Model**: [microsoft/kosmos-2.5-chat](https://huggingface.co/microsoft/kosmos-2.5-chat) | |
- **Paper**: [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419) | |
## Note | |
This is a generative model and may occasionally produce inaccurate results. Please verify outputs for critical applications. | |
## Citation | |
```bibtex | |
@article{lv2023kosmos, | |
title={Kosmos-2.5: A multimodal literate model}, | |
author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others}, | |
journal={arXiv preprint arXiv:2309.11419}, | |
year={2023} | |
} | |
``` |