Spaces:
Sleeping
Sleeping
title: Markit_v2 | |
emoji: π | |
colorFrom: blue | |
colorTo: indigo | |
sdk: gradio | |
sdk_version: 5.14.0 | |
app_file: app.py | |
build_script: build.sh | |
startup_script: setup.sh | |
pinned: false | |
hf_oauth: true | |
# Document to Markdown Converter with RAG Chat | |
**Author: Anse Min** | [π€ Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2) | [GitHub](https://github.com/ansemin/Markit_v2) | [LinkedIn](https://www.linkedin.com/in/ansemin/) | |
A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation). | |
## π₯ Demo Video | |
<div align="center"> | |
<a href="https://www.youtube.com/watch?v=PmXu3Si6hXo"> | |
<img src="https://img.youtube.com/vi/PmXu3Si6hXo/maxresdefault.jpg" alt="Markit Demo Video" width="600"> | |
</a> | |
**[βΆοΈ Watch Full Demo (YouTube)](https://www.youtube.com/watch?v=PmXu3Si6hXo)** | |
*Complete walkthrough of Markit's flagship features including multi-document processing, RAG chat, and advanced retrieval strategies* | |
</div> | |
<details> | |
<summary><strong>Table of contents</strong></summary> | |
<!-- Begin ToC --> | |
- [Demo Video](#-demo-video) | |
- [Live Demos](#-live-demos) | |
- [System Overview](#-system-overview) | |
- [Environment Setup](#-environment-setup) | |
- [Local Development](#-local-development) | |
- [Technical Details](#-technical-details) | |
<!-- End ToC --> | |
</details> | |
## π¬ Live Demos | |
### 1. Multi-Document Processing (Flagship Feature) | |
<div align="center"> | |
<img src="GIF/Multi-Document Processing Showcase.gif" alt="Multi-Document Processing Demo" width="800"> | |
</div> | |
**What it does:** Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types: | |
- **π Combined**: Merge documents with smart duplicate removal | |
- **π Individual**: Separate sections per document with clear organization | |
- **π Summary**: Executive overview + detailed analysis of all documents | |
- **βοΈ Comparison**: Cross-document analysis with similarities/differences tables | |
**Why it matters:** Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries. | |
<div align="center"> | |
<img src="img/Multi-Document Processing Types (Flagship Feature).png" alt="Multi-Document Processing Types" width="700"> | |
*Industry-leading multi-document processing with 4 intelligent processing types* | |
</div> | |
### 2. Single Document Conversion Flow | |
<div align="center"> | |
<img src="GIF/Single Document Conversion Flow.gif" alt="Single Document Conversion Demo" width="800"> | |
</div> | |
**What it does:** Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers: | |
- **Gemini Flash**: AI-powered understanding with high accuracy | |
- **Mistral OCR**: Fastest processing with document understanding | |
- **Docling**: Open source with advanced PDF table recognition | |
- **GOT-OCR**: Mathematical/scientific documents to LaTeX | |
- **MarkItDown**: High accuracy for CSV/XML and broad format support | |
**Why it matters:** Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction. | |
<div align="center"> | |
<img src="img/Parser Selection Guide (User-Friendly).png" alt="Parser Selection Guide" width="700"> | |
*Choose the right parser for your specific needs and document types* | |
</div> | |
### 3. RAG Chat System in Action | |
<div align="center"> | |
<img src="GIF/RAG Chat System in Action.gif" alt="RAG Chat System Demo" width="800"> | |
</div> | |
**What it does:** Chat with your converted documents using 4 advanced retrieval strategies: | |
- **π― Similarity**: Traditional semantic similarity using embeddings | |
- **π MMR**: Diverse results with reduced redundancy | |
- **π BM25**: Traditional keyword-based retrieval | |
- **π Hybrid**: Combines semantic + keyword search (recommended) | |
**Why it matters:** Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface. | |
<div align="center"> | |
<img src="img/RAG Retrieval Strategies (Technical Highlight).png" alt="RAG Retrieval Strategies" width="700"> | |
*Advanced RAG system with 4 retrieval strategies for optimal document search* | |
</div> | |
### 4. Query Ranker Analysis | |
<div align="center"> | |
<img src="GIF/Query Ranker Analysis.gif" alt="Query Ranker Demo" width="800"> | |
</div> | |
**What it does:** Interactive document search with: | |
- **Real-time ranking** of document chunks with confidence scores | |
- **Method comparison** to test different retrieval strategies | |
- **Adjustable results** (1-10) with responsive slider control | |
- **Transparent scoring** with actual ChromaDB similarity scores | |
**Why it matters:** Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies. | |
### 5. GOT-OCR LaTeX Processing | |
<div align="center"> | |
<img src="GIF/GOT-OCR LaTeX Processing.gif" alt="GOT-OCR LaTeX Demo" width="800"> | |
</div> | |
**What it does:** Advanced LaTeX processing for mathematical and scientific documents: | |
- **Native LaTeX output** with no LLM conversion for maximum accuracy | |
- **Mathpix rendering** using the same library as official GOT-OCR demo | |
- **RAG-compatible chunking** that preserves LaTeX structures and mathematical tables | |
- **Professional display** with proper mathematical formatting | |
**Why it matters:** Perfect for research papers, scientific documents, and academic content with complex equations and structured data. | |
## π― System Overview | |
<div align="center"> | |
<img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="600"> | |
*Complete workflow from document upload to intelligent RAG chat interaction* | |
</div> | |
## π§ Environment Setup | |
### Required API Keys | |
```bash | |
GOOGLE_API_KEY=your_gemini_api_key_here # For Gemini Flash parser and RAG chat | |
OPENAI_API_KEY=your_openai_api_key_here # For embeddings and AI descriptions | |
MISTRAL_API_KEY=your_mistral_api_key_here # For Mistral OCR parser (optional) | |
``` | |
### Key Configuration Options | |
```bash | |
DEBUG=true # Enable debug logging | |
MAX_FILE_SIZE=10485760 # 10MB per file limit | |
MAX_BATCH_FILES=5 # Maximum files for multi-document processing | |
MAX_BATCH_SIZE=20971520 # 20MB combined limit for batch processing | |
CHUNK_SIZE=1000 # Document chunk size for Markdown content | |
RETRIEVAL_K=4 # Number of documents to retrieve for RAG | |
``` | |
## π Local Development | |
### Quick Start | |
```bash | |
# Clone repository | |
git clone https://github.com/ansemin/Markit_v2 | |
cd Markit_v2 | |
# Create environment file | |
cp .env.example .env | |
# Edit .env with your API keys | |
# Install dependencies | |
pip install -r requirements.txt | |
# Run application | |
python app.py # Full environment setup (HF Spaces compatible) | |
python run_app.py # Local development (faster startup) | |
python run_app.py --clear-data-and-run # Testing with clean data | |
``` | |
### Data Management | |
**Two ways to clear data:** | |
1. **UI Method**: Chat tab β "ποΈ Clear All Data" button (works in both local and HF Space) | |
2. **CLI Method**: `python run_app.py --clear-data-and-run` | |
**What gets cleared:** Vector store embeddings, chat history, and session data | |
## π Technical Details | |
### Retrieval Strategy Performance | |
| Method | Best For | Accuracy | | |
|--------|----------|----------| | |
| **π― Similarity** | General semantic questions | Good | | |
| **π MMR** | Diverse perspectives | Good | | |
| **π BM25** | Exact keyword searches | Medium | | |
| **π Hybrid** | Most queries (recommended) | **Excellent** | | |
### Core Technologies | |
- **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown | |
- **RAG System**: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash | |
- **UI Framework**: Gradio with modular component architecture | |
- **GPU Support**: ZeroGPU integration for HF Spaces | |
### Smart Content-Aware Chunking | |
- **Markdown chunking**: Preserves tables and code blocks | |
- **LaTeX chunking**: Preserves mathematical tables, environments, and structures | |
- **Automatic format detection**: Optimal chunking strategy per document type | |
## Credits | |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft | |
- [Docling](https://github.com/DS4SD/docling) by IBM Research | |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) by StepFun | |
- [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering | |
- [Gradio](https://gradio.app/) for the UI framework | |
--- | |
**π [Try it live on Hugging Face Spaces](https://huggingface.co/spaces/Ansemin101/Markit_v2)** |