Spaces:
Sleeping
Sleeping
title: Document to Markdown Converter | |
emoji: π | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 4.44.0 | |
app_file: app.py | |
pinned: true | |
license: mit | |
python_version: 3.11 | |
hardware: cpu-basic | |
tags: | |
- document-processing | |
- markdown | |
- pdf-converter | |
- text-extraction | |
short_description: Convert PDF and DOCX documents to Markdown format | |
# π Document to Markdown Converter | |
Convert PDF and DOCX documents to Markdown format with intelligent structure analysis. | |
## Features | |
### π Supported Formats | |
- **PDF** - Extract text with formatting preservation | |
- **Word Documents** (.docx) - Full formatting and structure conversion | |
### π§ Smart Processing | |
- **Heading Detection** - Automatically detect headings based on styles and formatting | |
- **Table Extraction** - Convert tables to Markdown format | |
- **List Processing** - Preserve ordered and unordered lists | |
- **Inline Formatting** - Maintain bold, italic, and other text formatting | |
- **Structure Analysis** - Detailed document structure statistics | |
### β‘ Key Capabilities | |
- **Font-based Heading Detection** - Uses font size and styling to identify headings | |
- **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6) | |
- **Table Conversion** - Converts complex tables to Markdown table format | |
- **List Recognition** - Identifies and converts various list formats | |
- **Text Formatting** - Preserves bold, italic formatting in Markdown syntax | |
## Usage | |
### Basic Processing | |
1. Upload a PDF or DOCX file | |
2. Click "Convert to Markdown" | |
3. View the converted Markdown in the output tab | |
### Options | |
- **Structure Analysis**: Enable to see detailed document statistics | |
- **Preview Mode**: Show only the first 500 characters for quick preview | |
### Output Tabs | |
- **Markdown Output**: The complete converted Markdown text | |
- **Structure Analysis**: Statistics about headings, lists, tables, etc. | |
- **File Information**: Basic file details (name, type, size) | |
## Technical Details | |
### PDF Processing | |
- Uses PyMuPDF (fitz) for text extraction | |
- Analyzes font sizes to determine heading hierarchy | |
- Preserves text formatting flags (bold, italic) | |
- Processes text blocks while maintaining structure | |
### DOCX Processing | |
- Uses python-docx for document parsing | |
- Recognizes built-in Word styles | |
- Extracts tables with proper formatting | |
- Maintains paragraph-level formatting | |
### Structure Analysis | |
The application analyzes: | |
- **Headings**: Count by level (H1-H6) | |
- **Lists**: Ordered vs unordered list items | |
- **Tables**: Number of tables detected | |
- **Paragraphs**: Regular text paragraphs | |
- **Formatting**: Bold and italic text occurrences | |
- **Statistics**: Word count, character count, total lines | |
## Installation | |
### Local Development | |
```bash | |
# Clone the repository | |
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter | |
cd document-to-markdown-converter | |
# Install dependencies | |
pip install -r requirements.txt | |
# Run the application | |
python app.py | |
``` | |
### Dependencies | |
- `gradio>=4.0.0` - Web interface framework | |
- `python-docx>=1.1.0` - Word document processing | |
- `PyMuPDF>=1.23.0` - PDF processing library | |
## API | |
### Core Function | |
```python | |
def extract_document_to_markdown(file_path: str) -> Dict[str, Any]: | |
""" | |
Extract document content and convert to Markdown format | |
Args: | |
file_path: Path to PDF or DOCX file | |
Returns: | |
Dictionary containing: | |
- success: Boolean indicating success | |
- markdown: Converted Markdown content | |
- structure: Document structure analysis | |
- file_info: File metadata (name, type, size) | |
- preview: Short preview of content | |
- error: Error message if processing failed | |
""" | |
``` | |
### Structure Analysis Output | |
```json | |
{ | |
"headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0}, | |
"lists": {"ordered": 3, "unordered": 7}, | |
"tables": 2, | |
"paragraphs": 45, | |
"bold_text": 12, | |
"italic_text": 8, | |
"total_lines": 120, | |
"word_count": 2500, | |
"character_count": 15000 | |
} | |
``` | |
## Examples | |
### Converting a PDF | |
1. Upload a PDF file | |
2. The application will: | |
- Extract text from each page | |
- Detect headings based on font size | |
- Preserve bold/italic formatting | |
- Convert to clean Markdown | |
### Converting a DOCX | |
1. Upload a Word document | |
2. The application will: | |
- Parse document styles | |
- Convert headings based on style names | |
- Extract and format tables | |
- Maintain list structures | |
## Limitations | |
- **OCR**: Does not perform OCR on image-based PDFs | |
- **Complex Layouts**: May not perfectly preserve complex document layouts | |
- **Images**: Does not extract or convert embedded images | |
- **Fonts**: Limited font analysis for PDFs | |
## Contributing | |
1. Fork the repository | |
2. Create a feature branch | |
3. Make your changes | |
4. Test thoroughly | |
5. Submit a pull request | |
## License | |
MIT License - see LICENSE file for details. | |
## Support | |
For issues and feature requests, please use the Community tab or create an issue on GitHub. | |
--- | |
*Built with β€οΈ using Gradio, python-docx, and PyMuPDF* |