---
title: Document to Markdown Converter
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
python_version: 3.11
hardware: cpu-basic
tags:
  - document-processing
  - markdown
  - pdf-converter
  - text-extraction
short_description: Convert PDF and DOCX documents to Markdown format
---

# 📄 Document to Markdown Converter

Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.

## Features

### 📄 Supported Formats
- **PDF** - Extract text with formatting preservation
- **Word Documents** (.docx) - Full formatting and structure conversion

### 🧠 Smart Processing
- **Heading Detection** - Automatically detect headings based on styles and formatting
- **Table Extraction** - Convert tables to Markdown format
- **List Processing** - Preserve ordered and unordered lists
- **Inline Formatting** - Maintain bold, italic, and other text formatting
- **Structure Analysis** - Detailed document structure statistics

### ⚡ Key Capabilities
- **Font-based Heading Detection** - Uses font size and styling to identify headings
- **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6)
- **Table Conversion** - Converts complex tables to Markdown table format
- **List Recognition** - Identifies and converts various list formats
- **Text Formatting** - Preserves bold, italic formatting in Markdown syntax

## Usage

### Basic Processing
1. Upload a PDF or DOCX file
2. Click "Convert to Markdown"
3. View the converted Markdown in the output tab

### Options
- **Structure Analysis**: Enable to see detailed document statistics
- **Preview Mode**: Show only the first 500 characters for quick preview

### Output Tabs
- **Markdown Output**: The complete converted Markdown text
- **Structure Analysis**: Statistics about headings, lists, tables, etc.
- **File Information**: Basic file details (name, type, size)

## Technical Details

### PDF Processing
- Uses PyMuPDF (fitz) for text extraction
- Analyzes font sizes to determine heading hierarchy
- Preserves text formatting flags (bold, italic)
- Processes text blocks while maintaining structure

### DOCX Processing
- Uses python-docx for document parsing
- Recognizes built-in Word styles
- Extracts tables with proper formatting
- Maintains paragraph-level formatting

### Structure Analysis
The application analyzes:
- **Headings**: Count by level (H1-H6)
- **Lists**: Ordered vs unordered list items
- **Tables**: Number of tables detected
- **Paragraphs**: Regular text paragraphs
- **Formatting**: Bold and italic text occurrences
- **Statistics**: Word count, character count, total lines

## Installation

### Local Development
```bash
# Clone the repository
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
cd document-to-markdown-converter

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py
```

### Dependencies
- `gradio>=4.0.0` - Web interface framework
- `python-docx>=1.1.0` - Word document processing
- `PyMuPDF>=1.23.0` - PDF processing library

## API

### Core Function
```python
def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
    """
    Extract document content and convert to Markdown format
    
    Args:
        file_path: Path to PDF or DOCX file
    
    Returns:
        Dictionary containing:
        - success: Boolean indicating success
        - markdown: Converted Markdown content
        - structure: Document structure analysis
        - file_info: File metadata (name, type, size)
        - preview: Short preview of content
        - error: Error message if processing failed
    """
```

### Structure Analysis Output
```json
{
  "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
  "lists": {"ordered": 3, "unordered": 7},
  "tables": 2,
  "paragraphs": 45,
  "bold_text": 12,
  "italic_text": 8,
  "total_lines": 120,
  "word_count": 2500,
  "character_count": 15000
}
```

## Examples

### Converting a PDF
1. Upload a PDF file
2. The application will:
   - Extract text from each page
   - Detect headings based on font size
   - Preserve bold/italic formatting
   - Convert to clean Markdown

### Converting a DOCX
1. Upload a Word document
2. The application will:
   - Parse document styles
   - Convert headings based on style names
   - Extract and format tables
   - Maintain list structures

## Limitations

- **OCR**: Does not perform OCR on image-based PDFs
- **Complex Layouts**: May not perfectly preserve complex document layouts
- **Images**: Does not extract or convert embedded images
- **Fonts**: Limited font analysis for PDFs

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Support

For issues and feature requests, please use the Community tab or create an issue on GitHub.

---

*Built with ❤️ using Gradio, python-docx, and PyMuPDF*