wang.lingxiao
merge
4f8205f
---
title: Document to Markdown Converter
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
python_version: 3.11
hardware: cpu-basic
tags:
- document-processing
- markdown
- pdf-converter
- text-extraction
short_description: Convert PDF and DOCX documents to Markdown format
---
# πŸ“„ Document to Markdown Converter
Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.
## Features
### πŸ“„ Supported Formats
- **PDF** - Extract text with formatting preservation
- **Word Documents** (.docx) - Full formatting and structure conversion
### 🧠 Smart Processing
- **Heading Detection** - Automatically detect headings based on styles and formatting
- **Table Extraction** - Convert tables to Markdown format
- **List Processing** - Preserve ordered and unordered lists
- **Inline Formatting** - Maintain bold, italic, and other text formatting
- **Structure Analysis** - Detailed document structure statistics
### ⚑ Key Capabilities
- **Font-based Heading Detection** - Uses font size and styling to identify headings
- **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6)
- **Table Conversion** - Converts complex tables to Markdown table format
- **List Recognition** - Identifies and converts various list formats
- **Text Formatting** - Preserves bold, italic formatting in Markdown syntax
## Usage
### Basic Processing
1. Upload a PDF or DOCX file
2. Click "Convert to Markdown"
3. View the converted Markdown in the output tab
### Options
- **Structure Analysis**: Enable to see detailed document statistics
- **Preview Mode**: Show only the first 500 characters for quick preview
### Output Tabs
- **Markdown Output**: The complete converted Markdown text
- **Structure Analysis**: Statistics about headings, lists, tables, etc.
- **File Information**: Basic file details (name, type, size)
## Technical Details
### PDF Processing
- Uses PyMuPDF (fitz) for text extraction
- Analyzes font sizes to determine heading hierarchy
- Preserves text formatting flags (bold, italic)
- Processes text blocks while maintaining structure
### DOCX Processing
- Uses python-docx for document parsing
- Recognizes built-in Word styles
- Extracts tables with proper formatting
- Maintains paragraph-level formatting
### Structure Analysis
The application analyzes:
- **Headings**: Count by level (H1-H6)
- **Lists**: Ordered vs unordered list items
- **Tables**: Number of tables detected
- **Paragraphs**: Regular text paragraphs
- **Formatting**: Bold and italic text occurrences
- **Statistics**: Word count, character count, total lines
## Installation
### Local Development
```bash
# Clone the repository
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
cd document-to-markdown-converter
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
```
### Dependencies
- `gradio>=4.0.0` - Web interface framework
- `python-docx>=1.1.0` - Word document processing
- `PyMuPDF>=1.23.0` - PDF processing library
## API
### Core Function
```python
def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
"""
Extract document content and convert to Markdown format
Args:
file_path: Path to PDF or DOCX file
Returns:
Dictionary containing:
- success: Boolean indicating success
- markdown: Converted Markdown content
- structure: Document structure analysis
- file_info: File metadata (name, type, size)
- preview: Short preview of content
- error: Error message if processing failed
"""
```
### Structure Analysis Output
```json
{
"headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
"lists": {"ordered": 3, "unordered": 7},
"tables": 2,
"paragraphs": 45,
"bold_text": 12,
"italic_text": 8,
"total_lines": 120,
"word_count": 2500,
"character_count": 15000
}
```
## Examples
### Converting a PDF
1. Upload a PDF file
2. The application will:
- Extract text from each page
- Detect headings based on font size
- Preserve bold/italic formatting
- Convert to clean Markdown
### Converting a DOCX
1. Upload a Word document
2. The application will:
- Parse document styles
- Convert headings based on style names
- Extract and format tables
- Maintain list structures
## Limitations
- **OCR**: Does not perform OCR on image-based PDFs
- **Complex Layouts**: May not perfectly preserve complex document layouts
- **Images**: Does not extract or convert embedded images
- **Fonts**: Limited font analysis for PDFs
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## License
MIT License - see LICENSE file for details.
## Support
For issues and feature requests, please use the Community tab or create an issue on GitHub.
---
*Built with ❀️ using Gradio, python-docx, and PyMuPDF*