Spaces:

nielsr
/

kosmos-2.5-demo

Running on Zero

App Files Files Community

kosmos-2.5-demo / README.md

nielsr HF Staff

Add KOSMOS-2.5 Document AI Demo

fe64308 10 days ago

preview code

raw

history blame contribute delete

2.25 kB

	---
	title: KOSMOS-2.5 Document AI Demo
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# KOSMOS-2.5 Document AI Demo

	This Space demonstrates the capabilities of Microsoft's KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images.

	## Features

	🔥 Three powerful modes:

	1. 📝 Markdown Generation: Convert document images to clean markdown format
	2. 🔍 OCR with Bounding Boxes: Extract text with precise spatial coordinates and visualization
	3. 💬 Document Q&A: Ask questions about document content using KOSMOS-2.5 Chat

	## What is KOSMOS-2.5?

	KOSMOS-2.5 is Microsoft's latest document AI model that excels at understanding text-rich images. It can:

	- Generate spatially-aware text blocks with coordinates
	- Produce structured markdown output that captures document styles
	- Answer questions about document content through the chat variant

	The model was pre-trained on 357.4 million text-rich document images and achieves performance comparable to much larger models (1.3B vs 7B parameters) on visual question answering benchmarks.

	## Example Use Cases

	- Receipts: Extract itemized information or ask "What's the total amount?"
	- Forms: Convert to structured format or query specific fields
	- Articles: Get clean markdown or ask content-specific questions
	- Screenshots: Extract UI text or get information about elements

	## Model Information

	- Base Model: [microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5)
	- Chat Model: [microsoft/kosmos-2.5-chat](https://huggingface.co/microsoft/kosmos-2.5-chat)
	- Paper: [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)

	## Note

	This is a generative model and may occasionally produce inaccurate results. Please verify outputs for critical applications.

	## Citation

	```bibtex
	@article{lv2023kosmos,
	title={Kosmos-2.5: A multimodal literate model},
	author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
	journal={arXiv preprint arXiv:2309.11419},
	year={2023}
	}
	```