Spaces:
Sleeping
Sleeping
File size: 8,727 Bytes
ce97608 a773878 dda982a ce97608 dda982a ce97608 dda982a ce97608 a82ada4 ce97608 575f1c7 dbdd7c8 9e9e9ff 4a97b0c bf4414c 4a97b0c bf4414c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c dbdd7c8 9e9e9ff 4a97b0c dbdd7c8 9e9e9ff 4a97b0c 9e9e9ff dbdd7c8 9e9e9ff 111954a 9e9e9ff 21c909d 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff a4f1c9e 9e9e9ff a4f1c9e 9e9e9ff a4f1c9e 9e9e9ff a4f1c9e 9e9e9ff a4f1c9e 9e9e9ff a773878 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff 2a9686e 9e9e9ff dbdd7c8 9e9e9ff dbdd7c8 9e9e9ff dda982a 9e9e9ff dda982a 9e9e9ff dda982a 9e9e9ff dda982a 9e9e9ff 21c909d 9e9e9ff 4a97b0c dda982a 9e9e9ff dda982a 9e9e9ff 4a97b0c 9e9e9ff 4a97b0c 9e9e9ff 5910e0d 9e9e9ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
---
title: Markit_v2
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.14.0
app_file: app.py
build_script: build.sh
startup_script: setup.sh
pinned: false
hf_oauth: true
---
# Document to Markdown Converter with RAG Chat
**Author: Anse Min** | [π€ Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2) | [GitHub](https://github.com/ansemin/Markit_v2) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
## π₯ Demo Video
<div align="center">
<a href="https://www.youtube.com/watch?v=PmXu3Si6hXo">
<img src="https://img.youtube.com/vi/PmXu3Si6hXo/maxresdefault.jpg" alt="Markit Demo Video" width="600">
</a>
**[βΆοΈ Watch Full Demo (YouTube)](https://www.youtube.com/watch?v=PmXu3Si6hXo)**
*Complete walkthrough of Markit's flagship features including multi-document processing, RAG chat, and advanced retrieval strategies*
</div>
<details>
<summary><strong>Table of contents</strong></summary>
<!-- Begin ToC -->
- [Demo Video](#-demo-video)
- [Live Demos](#-live-demos)
- [System Overview](#-system-overview)
- [Environment Setup](#-environment-setup)
- [Local Development](#-local-development)
- [Technical Details](#-technical-details)
<!-- End ToC -->
</details>
## π¬ Live Demos
### 1. Multi-Document Processing (Flagship Feature)
<div align="center">
<img src="GIF/Multi-Document Processing Showcase.gif" alt="Multi-Document Processing Demo" width="800">
</div>
**What it does:** Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types:
- **π Combined**: Merge documents with smart duplicate removal
- **π Individual**: Separate sections per document with clear organization
- **π Summary**: Executive overview + detailed analysis of all documents
- **βοΈ Comparison**: Cross-document analysis with similarities/differences tables
**Why it matters:** Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries.
<div align="center">
<img src="img/Multi-Document Processing Types (Flagship Feature).png" alt="Multi-Document Processing Types" width="700">
*Industry-leading multi-document processing with 4 intelligent processing types*
</div>
### 2. Single Document Conversion Flow
<div align="center">
<img src="GIF/Single Document Conversion Flow.gif" alt="Single Document Conversion Demo" width="800">
</div>
**What it does:** Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers:
- **Gemini Flash**: AI-powered understanding with high accuracy
- **Mistral OCR**: Fastest processing with document understanding
- **Docling**: Open source with advanced PDF table recognition
- **GOT-OCR**: Mathematical/scientific documents to LaTeX
- **MarkItDown**: High accuracy for CSV/XML and broad format support
**Why it matters:** Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction.
<div align="center">
<img src="img/Parser Selection Guide (User-Friendly).png" alt="Parser Selection Guide" width="700">
*Choose the right parser for your specific needs and document types*
</div>
### 3. RAG Chat System in Action
<div align="center">
<img src="GIF/RAG Chat System in Action.gif" alt="RAG Chat System Demo" width="800">
</div>
**What it does:** Chat with your converted documents using 4 advanced retrieval strategies:
- **π― Similarity**: Traditional semantic similarity using embeddings
- **π MMR**: Diverse results with reduced redundancy
- **π BM25**: Traditional keyword-based retrieval
- **π Hybrid**: Combines semantic + keyword search (recommended)
**Why it matters:** Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface.
<div align="center">
<img src="img/RAG Retrieval Strategies (Technical Highlight).png" alt="RAG Retrieval Strategies" width="700">
*Advanced RAG system with 4 retrieval strategies for optimal document search*
</div>
### 4. Query Ranker Analysis
<div align="center">
<img src="GIF/Query Ranker Analysis.gif" alt="Query Ranker Demo" width="800">
</div>
**What it does:** Interactive document search with:
- **Real-time ranking** of document chunks with confidence scores
- **Method comparison** to test different retrieval strategies
- **Adjustable results** (1-10) with responsive slider control
- **Transparent scoring** with actual ChromaDB similarity scores
**Why it matters:** Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies.
### 5. GOT-OCR LaTeX Processing
<div align="center">
<img src="GIF/GOT-OCR LaTeX Processing.gif" alt="GOT-OCR LaTeX Demo" width="800">
</div>
**What it does:** Advanced LaTeX processing for mathematical and scientific documents:
- **Native LaTeX output** with no LLM conversion for maximum accuracy
- **Mathpix rendering** using the same library as official GOT-OCR demo
- **RAG-compatible chunking** that preserves LaTeX structures and mathematical tables
- **Professional display** with proper mathematical formatting
**Why it matters:** Perfect for research papers, scientific documents, and academic content with complex equations and structured data.
## π― System Overview
<div align="center">
<img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="600">
*Complete workflow from document upload to intelligent RAG chat interaction*
</div>
## π§ Environment Setup
### Required API Keys
```bash
GOOGLE_API_KEY=your_gemini_api_key_here # For Gemini Flash parser and RAG chat
OPENAI_API_KEY=your_openai_api_key_here # For embeddings and AI descriptions
MISTRAL_API_KEY=your_mistral_api_key_here # For Mistral OCR parser (optional)
```
### Key Configuration Options
```bash
DEBUG=true # Enable debug logging
MAX_FILE_SIZE=10485760 # 10MB per file limit
MAX_BATCH_FILES=5 # Maximum files for multi-document processing
MAX_BATCH_SIZE=20971520 # 20MB combined limit for batch processing
CHUNK_SIZE=1000 # Document chunk size for Markdown content
RETRIEVAL_K=4 # Number of documents to retrieve for RAG
```
## π Local Development
### Quick Start
```bash
# Clone repository
git clone https://github.com/ansemin/Markit_v2
cd Markit_v2
# Create environment file
cp .env.example .env
# Edit .env with your API keys
# Install dependencies
pip install -r requirements.txt
# Run application
python app.py # Full environment setup (HF Spaces compatible)
python run_app.py # Local development (faster startup)
python run_app.py --clear-data-and-run # Testing with clean data
```
### Data Management
**Two ways to clear data:**
1. **UI Method**: Chat tab β "ποΈ Clear All Data" button (works in both local and HF Space)
2. **CLI Method**: `python run_app.py --clear-data-and-run`
**What gets cleared:** Vector store embeddings, chat history, and session data
## π Technical Details
### Retrieval Strategy Performance
| Method | Best For | Accuracy |
|--------|----------|----------|
| **π― Similarity** | General semantic questions | Good |
| **π MMR** | Diverse perspectives | Good |
| **π BM25** | Exact keyword searches | Medium |
| **π Hybrid** | Most queries (recommended) | **Excellent** |
### Core Technologies
- **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
- **RAG System**: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash
- **UI Framework**: Gradio with modular component architecture
- **GPU Support**: ZeroGPU integration for HF Spaces
### Smart Content-Aware Chunking
- **Markdown chunking**: Preserves tables and code blocks
- **LaTeX chunking**: Preserves mathematical tables, environments, and structures
- **Automatic format detection**: Optimal chunking strategy per document type
## Credits
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
- [Docling](https://github.com/DS4SD/docling) by IBM Research
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) by StepFun
- [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
- [Gradio](https://gradio.app/) for the UI framework
---
**π [Try it live on Hugging Face Spaces](https://huggingface.co/spaces/Ansemin101/Markit_v2)** |