Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.45.0
metadata
title: Document to Markdown Converter
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
python_version: 3.11
hardware: cpu-basic
tags:
- document-processing
- markdown
- pdf-converter
- text-extraction
short_description: Convert PDF and DOCX documents to Markdown format
π Document to Markdown Converter
Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.
Features
π Supported Formats
- PDF - Extract text with formatting preservation
- Word Documents (.docx) - Full formatting and structure conversion
π§ Smart Processing
- Heading Detection - Automatically detect headings based on styles and formatting
- Table Extraction - Convert tables to Markdown format
- List Processing - Preserve ordered and unordered lists
- Inline Formatting - Maintain bold, italic, and other text formatting
- Structure Analysis - Detailed document structure statistics
β‘ Key Capabilities
- Font-based Heading Detection - Uses font size and styling to identify headings
- Style Recognition - Recognizes Word document styles (Title, Heading 1-6)
- Table Conversion - Converts complex tables to Markdown table format
- List Recognition - Identifies and converts various list formats
- Text Formatting - Preserves bold, italic formatting in Markdown syntax
Usage
Basic Processing
- Upload a PDF or DOCX file
- Click "Convert to Markdown"
- View the converted Markdown in the output tab
Options
- Structure Analysis: Enable to see detailed document statistics
- Preview Mode: Show only the first 500 characters for quick preview
Output Tabs
- Markdown Output: The complete converted Markdown text
- Structure Analysis: Statistics about headings, lists, tables, etc.
- File Information: Basic file details (name, type, size)
Technical Details
PDF Processing
- Uses PyMuPDF (fitz) for text extraction
- Analyzes font sizes to determine heading hierarchy
- Preserves text formatting flags (bold, italic)
- Processes text blocks while maintaining structure
DOCX Processing
- Uses python-docx for document parsing
- Recognizes built-in Word styles
- Extracts tables with proper formatting
- Maintains paragraph-level formatting
Structure Analysis
The application analyzes:
- Headings: Count by level (H1-H6)
- Lists: Ordered vs unordered list items
- Tables: Number of tables detected
- Paragraphs: Regular text paragraphs
- Formatting: Bold and italic text occurrences
- Statistics: Word count, character count, total lines
Installation
Local Development
# Clone the repository
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
cd document-to-markdown-converter
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
Dependencies
gradio>=4.0.0
- Web interface frameworkpython-docx>=1.1.0
- Word document processingPyMuPDF>=1.23.0
- PDF processing library
API
Core Function
def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
"""
Extract document content and convert to Markdown format
Args:
file_path: Path to PDF or DOCX file
Returns:
Dictionary containing:
- success: Boolean indicating success
- markdown: Converted Markdown content
- structure: Document structure analysis
- file_info: File metadata (name, type, size)
- preview: Short preview of content
- error: Error message if processing failed
"""
Structure Analysis Output
{
"headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
"lists": {"ordered": 3, "unordered": 7},
"tables": 2,
"paragraphs": 45,
"bold_text": 12,
"italic_text": 8,
"total_lines": 120,
"word_count": 2500,
"character_count": 15000
}
Examples
Converting a PDF
- Upload a PDF file
- The application will:
- Extract text from each page
- Detect headings based on font size
- Preserve bold/italic formatting
- Convert to clean Markdown
Converting a DOCX
- Upload a Word document
- The application will:
- Parse document styles
- Convert headings based on style names
- Extract and format tables
- Maintain list structures
Limitations
- OCR: Does not perform OCR on image-based PDFs
- Complex Layouts: May not perfectly preserve complex document layouts
- Images: Does not extract or convert embedded images
- Fonts: Limited font analysis for PDFs
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and feature requests, please use the Community tab or create an issue on GitHub.
Built with β€οΈ using Gradio, python-docx, and PyMuPDF