wang.lingxiao
merge
4f8205f

A newer version of the Gradio SDK is available: 5.45.0

Upgrade
metadata
title: Document to Markdown Converter
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
python_version: 3.11
hardware: cpu-basic
tags:
  - document-processing
  - markdown
  - pdf-converter
  - text-extraction
short_description: Convert PDF and DOCX documents to Markdown format

πŸ“„ Document to Markdown Converter

Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.

Features

πŸ“„ Supported Formats

  • PDF - Extract text with formatting preservation
  • Word Documents (.docx) - Full formatting and structure conversion

🧠 Smart Processing

  • Heading Detection - Automatically detect headings based on styles and formatting
  • Table Extraction - Convert tables to Markdown format
  • List Processing - Preserve ordered and unordered lists
  • Inline Formatting - Maintain bold, italic, and other text formatting
  • Structure Analysis - Detailed document structure statistics

⚑ Key Capabilities

  • Font-based Heading Detection - Uses font size and styling to identify headings
  • Style Recognition - Recognizes Word document styles (Title, Heading 1-6)
  • Table Conversion - Converts complex tables to Markdown table format
  • List Recognition - Identifies and converts various list formats
  • Text Formatting - Preserves bold, italic formatting in Markdown syntax

Usage

Basic Processing

  1. Upload a PDF or DOCX file
  2. Click "Convert to Markdown"
  3. View the converted Markdown in the output tab

Options

  • Structure Analysis: Enable to see detailed document statistics
  • Preview Mode: Show only the first 500 characters for quick preview

Output Tabs

  • Markdown Output: The complete converted Markdown text
  • Structure Analysis: Statistics about headings, lists, tables, etc.
  • File Information: Basic file details (name, type, size)

Technical Details

PDF Processing

  • Uses PyMuPDF (fitz) for text extraction
  • Analyzes font sizes to determine heading hierarchy
  • Preserves text formatting flags (bold, italic)
  • Processes text blocks while maintaining structure

DOCX Processing

  • Uses python-docx for document parsing
  • Recognizes built-in Word styles
  • Extracts tables with proper formatting
  • Maintains paragraph-level formatting

Structure Analysis

The application analyzes:

  • Headings: Count by level (H1-H6)
  • Lists: Ordered vs unordered list items
  • Tables: Number of tables detected
  • Paragraphs: Regular text paragraphs
  • Formatting: Bold and italic text occurrences
  • Statistics: Word count, character count, total lines

Installation

Local Development

# Clone the repository
git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
cd document-to-markdown-converter

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

Dependencies

  • gradio>=4.0.0 - Web interface framework
  • python-docx>=1.1.0 - Word document processing
  • PyMuPDF>=1.23.0 - PDF processing library

API

Core Function

def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
    """
    Extract document content and convert to Markdown format
    
    Args:
        file_path: Path to PDF or DOCX file
    
    Returns:
        Dictionary containing:
        - success: Boolean indicating success
        - markdown: Converted Markdown content
        - structure: Document structure analysis
        - file_info: File metadata (name, type, size)
        - preview: Short preview of content
        - error: Error message if processing failed
    """

Structure Analysis Output

{
  "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
  "lists": {"ordered": 3, "unordered": 7},
  "tables": 2,
  "paragraphs": 45,
  "bold_text": 12,
  "italic_text": 8,
  "total_lines": 120,
  "word_count": 2500,
  "character_count": 15000
}

Examples

Converting a PDF

  1. Upload a PDF file
  2. The application will:
    • Extract text from each page
    • Detect headings based on font size
    • Preserve bold/italic formatting
    • Convert to clean Markdown

Converting a DOCX

  1. Upload a Word document
  2. The application will:
    • Parse document styles
    • Convert headings based on style names
    • Extract and format tables
    • Maintain list structures

Limitations

  • OCR: Does not perform OCR on image-based PDFs
  • Complex Layouts: May not perfectly preserve complex document layouts
  • Images: Does not extract or convert embedded images
  • Fonts: Limited font analysis for PDFs

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and feature requests, please use the Community tab or create an issue on GitHub.


Built with ❀️ using Gradio, python-docx, and PyMuPDF