File size: 8,727 Bytes
ce97608
a773878
dda982a
 
 
ce97608
dda982a
ce97608
dda982a
 
ce97608
a82ada4
ce97608
 
575f1c7
dbdd7c8
9e9e9ff
 
4a97b0c
 
bf4414c
 
 
 
 
 
 
 
 
 
 
 
4a97b0c
 
 
 
 
bf4414c
9e9e9ff
4a97b0c
9e9e9ff
 
 
4a97b0c
 
 
 
 
9e9e9ff
4a97b0c
9e9e9ff
4a97b0c
9e9e9ff
4a97b0c
dbdd7c8
9e9e9ff
 
4a97b0c
 
 
dbdd7c8
9e9e9ff
4a97b0c
9e9e9ff
 
dbdd7c8
9e9e9ff
 
111954a
9e9e9ff
 
 
 
21c909d
9e9e9ff
 
 
 
 
 
4a97b0c
9e9e9ff
4a97b0c
9e9e9ff
 
4a97b0c
9e9e9ff
 
4a97b0c
9e9e9ff
 
 
 
a4f1c9e
9e9e9ff
 
 
 
 
a4f1c9e
9e9e9ff
a4f1c9e
9e9e9ff
 
a4f1c9e
9e9e9ff
 
a4f1c9e
9e9e9ff
 
 
 
a773878
9e9e9ff
 
 
 
 
2a9686e
9e9e9ff
2a9686e
9e9e9ff
 
 
 
2a9686e
9e9e9ff
 
 
 
 
2a9686e
9e9e9ff
2a9686e
9e9e9ff
2a9686e
9e9e9ff
 
2a9686e
9e9e9ff
 
2a9686e
9e9e9ff
2a9686e
9e9e9ff
 
 
 
 
2a9686e
 
9e9e9ff
 
 
 
 
 
 
 
 
2a9686e
9e9e9ff
dbdd7c8
9e9e9ff
 
 
 
 
dbdd7c8
9e9e9ff
 
 
dda982a
9e9e9ff
 
dda982a
9e9e9ff
 
 
 
 
dda982a
9e9e9ff
 
 
 
dda982a
9e9e9ff
21c909d
9e9e9ff
 
 
4a97b0c
 
 
 
 
 
dda982a
9e9e9ff
 
 
 
 
dda982a
9e9e9ff
 
 
 
4a97b0c
9e9e9ff
4a97b0c
9e9e9ff
 
 
 
 
5910e0d
9e9e9ff
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: Markit_v2
emoji: πŸ“„
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.14.0
app_file: app.py
build_script: build.sh
startup_script: setup.sh
pinned: false
hf_oauth: true
---

# Document to Markdown Converter with RAG Chat

**Author: Anse Min** | [πŸ€— Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2) | [GitHub](https://github.com/ansemin/Markit_v2) | [LinkedIn](https://www.linkedin.com/in/ansemin/)

A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).

## πŸŽ₯ Demo Video

<div align="center">
<a href="https://www.youtube.com/watch?v=PmXu3Si6hXo">
<img src="https://img.youtube.com/vi/PmXu3Si6hXo/maxresdefault.jpg" alt="Markit Demo Video" width="600">
</a>

**[▢️ Watch Full Demo (YouTube)](https://www.youtube.com/watch?v=PmXu3Si6hXo)**

*Complete walkthrough of Markit's flagship features including multi-document processing, RAG chat, and advanced retrieval strategies*
</div>

<details>
<summary><strong>Table of contents</strong></summary>

<!-- Begin ToC -->

- [Demo Video](#-demo-video)
- [Live Demos](#-live-demos)
- [System Overview](#-system-overview)
- [Environment Setup](#-environment-setup)
- [Local Development](#-local-development)
- [Technical Details](#-technical-details)

<!-- End ToC -->

</details>

## 🎬 Live Demos

### 1. Multi-Document Processing (Flagship Feature)
<div align="center">
<img src="GIF/Multi-Document Processing Showcase.gif" alt="Multi-Document Processing Demo" width="800">
</div>

**What it does:** Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types:
- **πŸ”— Combined**: Merge documents with smart duplicate removal
- **πŸ“‘ Individual**: Separate sections per document with clear organization  
- **πŸ“ˆ Summary**: Executive overview + detailed analysis of all documents
- **βš–οΈ Comparison**: Cross-document analysis with similarities/differences tables

**Why it matters:** Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries.

<div align="center">
<img src="img/Multi-Document Processing Types (Flagship Feature).png" alt="Multi-Document Processing Types" width="700">

*Industry-leading multi-document processing with 4 intelligent processing types*
</div>

### 2. Single Document Conversion Flow
<div align="center">
<img src="GIF/Single Document Conversion Flow.gif" alt="Single Document Conversion Demo" width="800">
</div>

**What it does:** Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers:
- **Gemini Flash**: AI-powered understanding with high accuracy
- **Mistral OCR**: Fastest processing with document understanding
- **Docling**: Open source with advanced PDF table recognition  
- **GOT-OCR**: Mathematical/scientific documents to LaTeX
- **MarkItDown**: High accuracy for CSV/XML and broad format support

**Why it matters:** Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction.

<div align="center">
<img src="img/Parser Selection Guide (User-Friendly).png" alt="Parser Selection Guide" width="700">

*Choose the right parser for your specific needs and document types*
</div>

### 3. RAG Chat System in Action
<div align="center">
<img src="GIF/RAG Chat System in Action.gif" alt="RAG Chat System Demo" width="800">
</div>

**What it does:** Chat with your converted documents using 4 advanced retrieval strategies:
- **🎯 Similarity**: Traditional semantic similarity using embeddings
- **πŸ”€ MMR**: Diverse results with reduced redundancy  
- **πŸ” BM25**: Traditional keyword-based retrieval
- **πŸ”— Hybrid**: Combines semantic + keyword search (recommended)

**Why it matters:** Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface.

<div align="center">
<img src="img/RAG Retrieval Strategies (Technical Highlight).png" alt="RAG Retrieval Strategies" width="700">

*Advanced RAG system with 4 retrieval strategies for optimal document search*
</div>

### 4. Query Ranker Analysis
<div align="center">
<img src="GIF/Query Ranker Analysis.gif" alt="Query Ranker Demo" width="800">
</div>

**What it does:** Interactive document search with:
- **Real-time ranking** of document chunks with confidence scores
- **Method comparison** to test different retrieval strategies
- **Adjustable results** (1-10) with responsive slider control
- **Transparent scoring** with actual ChromaDB similarity scores

**Why it matters:** Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies.

### 5. GOT-OCR LaTeX Processing
<div align="center">
<img src="GIF/GOT-OCR LaTeX Processing.gif" alt="GOT-OCR LaTeX Demo" width="800">
</div>

**What it does:** Advanced LaTeX processing for mathematical and scientific documents:
- **Native LaTeX output** with no LLM conversion for maximum accuracy
- **Mathpix rendering** using the same library as official GOT-OCR demo
- **RAG-compatible chunking** that preserves LaTeX structures and mathematical tables
- **Professional display** with proper mathematical formatting

**Why it matters:** Perfect for research papers, scientific documents, and academic content with complex equations and structured data.

## 🎯 System Overview

<div align="center">
<img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="600">

*Complete workflow from document upload to intelligent RAG chat interaction*
</div>

## πŸ”§ Environment Setup

### Required API Keys
```bash
GOOGLE_API_KEY=your_gemini_api_key_here    # For Gemini Flash parser and RAG chat
OPENAI_API_KEY=your_openai_api_key_here    # For embeddings and AI descriptions  
MISTRAL_API_KEY=your_mistral_api_key_here  # For Mistral OCR parser (optional)
```

### Key Configuration Options
```bash
DEBUG=true                        # Enable debug logging
MAX_FILE_SIZE=10485760           # 10MB per file limit
MAX_BATCH_FILES=5                # Maximum files for multi-document processing
MAX_BATCH_SIZE=20971520          # 20MB combined limit for batch processing
CHUNK_SIZE=1000                  # Document chunk size for Markdown content
RETRIEVAL_K=4                    # Number of documents to retrieve for RAG
```

## πŸš€ Local Development

### Quick Start
```bash
# Clone repository
git clone https://github.com/ansemin/Markit_v2
cd Markit_v2

# Create environment file
cp .env.example .env
# Edit .env with your API keys

# Install dependencies
pip install -r requirements.txt

# Run application
python app.py                    # Full environment setup (HF Spaces compatible)
python run_app.py               # Local development (faster startup)
python run_app.py --clear-data-and-run  # Testing with clean data
```

### Data Management
**Two ways to clear data:**
1. **UI Method**: Chat tab β†’ "πŸ—‘οΈ Clear All Data" button (works in both local and HF Space)
2. **CLI Method**: `python run_app.py --clear-data-and-run`

**What gets cleared:** Vector store embeddings, chat history, and session data

## πŸ” Technical Details

### Retrieval Strategy Performance
| Method | Best For | Accuracy |
|--------|----------|----------|
| **🎯 Similarity** | General semantic questions | Good |
| **πŸ”€ MMR** | Diverse perspectives | Good |
| **πŸ” BM25** | Exact keyword searches | Medium |
| **πŸ”— Hybrid** | Most queries (recommended) | **Excellent** |

### Core Technologies
- **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
- **RAG System**: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash
- **UI Framework**: Gradio with modular component architecture  
- **GPU Support**: ZeroGPU integration for HF Spaces

### Smart Content-Aware Chunking
- **Markdown chunking**: Preserves tables and code blocks
- **LaTeX chunking**: Preserves mathematical tables, environments, and structures
- **Automatic format detection**: Optimal chunking strategy per document type

## Credits

- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
- [Docling](https://github.com/DS4SD/docling) by IBM Research
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) by StepFun
- [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
- [Gradio](https://gradio.app/) for the UI framework

---

**πŸš€ [Try it live on Hugging Face Spaces](https://huggingface.co/spaces/Ansemin101/Markit_v2)**