--- title: Document to Markdown Converter emoji: 📄 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: true license: mit python_version: 3.11 hardware: cpu-basic tags: - document-processing - markdown - pdf-converter - text-extraction short_description: Convert PDF and DOCX documents to Markdown format --- # 📄 Document to Markdown Converter Convert PDF and DOCX documents to Markdown format with intelligent structure analysis. ## Features ### 📄 Supported Formats - **PDF** - Extract text with formatting preservation - **Word Documents** (.docx) - Full formatting and structure conversion ### 🧠 Smart Processing - **Heading Detection** - Automatically detect headings based on styles and formatting - **Table Extraction** - Convert tables to Markdown format - **List Processing** - Preserve ordered and unordered lists - **Inline Formatting** - Maintain bold, italic, and other text formatting - **Structure Analysis** - Detailed document structure statistics ### ⚡ Key Capabilities - **Font-based Heading Detection** - Uses font size and styling to identify headings - **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6) - **Table Conversion** - Converts complex tables to Markdown table format - **List Recognition** - Identifies and converts various list formats - **Text Formatting** - Preserves bold, italic formatting in Markdown syntax ## Usage ### Basic Processing 1. Upload a PDF or DOCX file 2. Click "Convert to Markdown" 3. View the converted Markdown in the output tab ### Options - **Structure Analysis**: Enable to see detailed document statistics - **Preview Mode**: Show only the first 500 characters for quick preview ### Output Tabs - **Markdown Output**: The complete converted Markdown text - **Structure Analysis**: Statistics about headings, lists, tables, etc. - **File Information**: Basic file details (name, type, size) ## Technical Details ### PDF Processing - Uses PyMuPDF (fitz) for text extraction - Analyzes font sizes to determine heading hierarchy - Preserves text formatting flags (bold, italic) - Processes text blocks while maintaining structure ### DOCX Processing - Uses python-docx for document parsing - Recognizes built-in Word styles - Extracts tables with proper formatting - Maintains paragraph-level formatting ### Structure Analysis The application analyzes: - **Headings**: Count by level (H1-H6) - **Lists**: Ordered vs unordered list items - **Tables**: Number of tables detected - **Paragraphs**: Regular text paragraphs - **Formatting**: Bold and italic text occurrences - **Statistics**: Word count, character count, total lines ## Installation ### Local Development ```bash # Clone the repository git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter cd document-to-markdown-converter # Install dependencies pip install -r requirements.txt # Run the application python app.py ``` ### Dependencies - `gradio>=4.0.0` - Web interface framework - `python-docx>=1.1.0` - Word document processing - `PyMuPDF>=1.23.0` - PDF processing library ## API ### Core Function ```python def extract_document_to_markdown(file_path: str) -> Dict[str, Any]: """ Extract document content and convert to Markdown format Args: file_path: Path to PDF or DOCX file Returns: Dictionary containing: - success: Boolean indicating success - markdown: Converted Markdown content - structure: Document structure analysis - file_info: File metadata (name, type, size) - preview: Short preview of content - error: Error message if processing failed """ ``` ### Structure Analysis Output ```json { "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0}, "lists": {"ordered": 3, "unordered": 7}, "tables": 2, "paragraphs": 45, "bold_text": 12, "italic_text": 8, "total_lines": 120, "word_count": 2500, "character_count": 15000 } ``` ## Examples ### Converting a PDF 1. Upload a PDF file 2. The application will: - Extract text from each page - Detect headings based on font size - Preserve bold/italic formatting - Convert to clean Markdown ### Converting a DOCX 1. Upload a Word document 2. The application will: - Parse document styles - Convert headings based on style names - Extract and format tables - Maintain list structures ## Limitations - **OCR**: Does not perform OCR on image-based PDFs - **Complex Layouts**: May not perfectly preserve complex document layouts - **Images**: Does not extract or convert embedded images - **Fonts**: Limited font analysis for PDFs ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly 5. Submit a pull request ## License MIT License - see LICENSE file for details. ## Support For issues and feature requests, please use the Community tab or create an issue on GitHub. --- *Built with ❤️ using Gradio, python-docx, and PyMuPDF*