metadata

title: Document to Markdown Converter
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
python_version: 3.11
hardware: cpu-basic
tags:
  - document-processing
  - markdown
  - pdf-converter
  - text-extraction
short_description: Convert PDF and DOCX documents to Markdown format

📄 Document to Markdown Converter

Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.

Features

📄 Supported Formats

PDF - Extract text with formatting preservation
Word Documents (.docx) - Full formatting and structure conversion

🧠 Smart Processing

Heading Detection - Automatically detect headings based on styles and formatting
Table Extraction - Convert tables to Markdown format
List Processing - Preserve ordered and unordered lists
Inline Formatting - Maintain bold, italic, and other text formatting
Structure Analysis - Detailed document structure statistics

⚡ Key Capabilities

Font-based Heading Detection - Uses font size and styling to identify headings
Style Recognition - Recognizes Word document styles (Title, Heading 1-6)
Table Conversion - Converts complex tables to Markdown table format
List Recognition - Identifies and converts various list formats
Text Formatting - Preserves bold, italic formatting in Markdown syntax

Usage

Basic Processing

Upload a PDF or DOCX file
Click "Convert to Markdown"
View the converted Markdown in the output tab

Options

Structure Analysis: Enable to see detailed document statistics
Preview Mode: Show only the first 500 characters for quick preview

Output Tabs

Markdown Output: The complete converted Markdown text
Structure Analysis: Statistics about headings, lists, tables, etc.
File Information: Basic file details (name, type, size)

Technical Details

PDF Processing

Uses PyMuPDF (fitz) for text extraction
Analyzes font sizes to determine heading hierarchy
Preserves text formatting flags (bold, italic)
Processes text blocks while maintaining structure

DOCX Processing

Uses python-docx for document parsing
Recognizes built-in Word styles
Extracts tables with proper formatting
Maintains paragraph-level formatting

Structure Analysis

The application analyzes:

Headings: Count by level (H1-H6)
Lists: Ordered vs unordered list items
Tables: Number of tables detected
Paragraphs: Regular text paragraphs
Formatting: Bold and italic text occurrences
Statistics: Word count, character count, total lines

Installation

Local Development

# Clone the repository
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
cd document-to-markdown-converter

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

Dependencies

gradio>=4.0.0 - Web interface framework
python-docx>=1.1.0 - Word document processing
PyMuPDF>=1.23.0 - PDF processing library

API

Core Function

def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
    """
    Extract document content and convert to Markdown format
    
    Args:
        file_path: Path to PDF or DOCX file
    
    Returns:
        Dictionary containing:
        - success: Boolean indicating success
        - markdown: Converted Markdown content
        - structure: Document structure analysis
        - file_info: File metadata (name, type, size)
        - preview: Short preview of content
        - error: Error message if processing failed
    """

Structure Analysis Output

{
  "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
  "lists": {"ordered": 3, "unordered": 7},
  "tables": 2,
  "paragraphs": 45,
  "bold_text": 12,
  "italic_text": 8,
  "total_lines": 120,
  "word_count": 2500,
  "character_count": 15000
}

Examples

Converting a PDF

Upload a PDF file
The application will:
- Extract text from each page
- Detect headings based on font size
- Preserve bold/italic formatting
- Convert to clean Markdown

Converting a DOCX

Upload a Word document
The application will:
- Parse document styles
- Convert headings based on style names
- Extract and format tables
- Maintain list structures

Limitations

OCR: Does not perform OCR on image-based PDFs
Complex Layouts: May not perfectly preserve complex document layouts
Images: Does not extract or convert embedded images
Fonts: Limited font analysis for PDFs

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and feature requests, please use the Community tab or create an issue on GitHub.

Built with ❤️ using Gradio, python-docx, and PyMuPDF