metadata

title: Docling
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Welcome to Streamlit!

Edit /src/streamlit_app.py to customize this app to your heart's desire. :heart:

If you have any questions, checkout our documentation and community forums.

Medical Document Parser & Redactor

A sophisticated medical document processing application that uses Docling (structure-aware parser) to parse PDF medical documents and automatically redact medication information using AI-powered analysis.

🎯 Overview

This application provides a Streamlit-based interface for uploading medical PDF documents, parsing them with Docling to extract structured content, and using Azure OpenAI to intelligently identify and redact formal medication lists while preserving clinical context.

🏗️ Project Structure

docling/
├── src/                          # Main source code
│   ├── processing/               # Core processing logic
│   │   ├── __init__.py
│   │   ├── document_processor.py # Main document processing pipeline
│   │   ├── llm_extractor.py      # Azure OpenAI integration for medication detection
│   │   └── sections.py           # Section extraction and redaction logic
│   ├── utils/                    # Utility functions
│   │   ├── __init__.py
│   │   └── logging_utils.py      # Logging configuration and handlers
│   └── streamlit_app.py          # Main Streamlit application interface
├── temp_files/                   # Temporary file storage (auto-created)
├── .env                          # Environment variables (Azure OpenAI credentials)
├── requirements.txt              # Python dependencies
├── pyproject.toml               # Project configuration
├── Dockerfile                   # Container configuration
└── README.md                    # This file

📁 File Responsibilities

Core Processing Files

`src/processing/document_processor.py`

Purpose: Main document processing pipeline that orchestrates the entire workflow.

Key Classes:

DocumentResult: Data class holding processed results
DocumentProcessor: Main processing class

Key Functions:

process(file_path): Main processing method
_export_redacted_markdown(): Generates redacted markdown
_reconstruct_markdown_from_filtered_texts(): Reconstructs markdown from filtered content

Responsibilities:

Document conversion using Docling
Section redaction coordination
Markdown generation and reconstruction
File persistence and logging

`src/processing/llm_extractor.py`

Purpose: Azure OpenAI integration for intelligent medication detection.

Key Classes:

AzureO1MedicationExtractor: LLM-based medication extractor

Key Functions:

extract_medication_sections(doc_json): Main extraction method
__init__(): Azure OpenAI client initialization

Responsibilities:

Azure OpenAI API communication
Medication section identification
Structured JSON response generation
Error handling and logging

`src/processing/sections.py`

Purpose: Section extraction and redaction logic.

Key Classes:

ReasoningSectionExtractor: AI-powered section extractor
SectionDefinition: Section definition data class
SectionExtractor: Traditional regex-based extractor

Key Functions:

remove_sections_from_json(): JSON-based section removal
remove_sections(): Text-based section removal (fallback)

Responsibilities:

Section identification and removal
JSON structure manipulation
Text processing and redaction
Reasoning logging and transparency

Interface Files

`src/streamlit_app.py`

Purpose: Main Streamlit web application interface.

Key Functions:

save_uploaded_file(): File upload handling
cleanup_temp_files(): Temporary file management
create_diff_content(): Diff view generation

Responsibilities:

User interface and interaction
File upload and management
Visualization and diff display
Session state management
Download functionality

Utility Files

`src/utils/logging_utils.py`

Purpose: Logging configuration and management.

Key Functions:

get_log_handler(): Creates in-memory log handlers
Log buffer management for UI display

Responsibilities:

Logging setup and configuration
In-memory log capture
Log display in UI

🔧 Detailed Function Documentation

Document Processing Pipeline

`DocumentProcessor.process(file_path: str) -> DocumentResult`

Purpose: Main entry point for document processing.

Parameters:

file_path: Path to the PDF file to process

Returns:

DocumentResult: Object containing all processing results

Process Flow:

Converts PDF using Docling
Exports structured markdown and JSON
Applies section redaction if extractor is provided
Persists results to temporary files
Returns comprehensive result object

Example Usage:

processor = DocumentProcessor(section_extractor=extractor)
result = processor.process("document.pdf")
print(f"Original: {len(result.structured_markdown)} chars")
print(f"Redacted: {len(result.redacted_markdown)} chars")

`AzureO1MedicationExtractor.extract_medication_sections(doc_json: Dict) -> Dict`

Purpose: Uses Azure OpenAI to identify medication sections for redaction.

Parameters:

doc_json: Docling-generated JSON structure

Returns:

Dictionary with indices to remove and reasoning

Process Flow:

Analyzes document structure
Sends structured prompt to Azure OpenAI
Parses JSON response
Validates and limits results
Returns structured analysis

Example Usage:

extractor = AzureO1MedicationExtractor(endpoint, api_key, version, deployment)
result = extractor.extract_medication_sections(doc_json)
print(f"Removing {len(result['indices_to_remove'])} elements")

`ReasoningSectionExtractor.remove_sections_from_json(doc_json: Dict) -> Dict`

Purpose: Removes identified sections from JSON structure.

Parameters:

doc_json: Original document JSON structure

Returns:

Redacted JSON structure

Process Flow:

Calls LLM extractor for analysis
Logs detailed reasoning
Removes identified text elements
Updates document structure
Returns redacted JSON

🚨 Troubleshooting

Permission Error: `[Errno 13] Permission denied: '/.cache'`

Problem: When deploying to Hugging Face Spaces, you may encounter a permission error where the application tries to create cache directories in the root filesystem (/.cache).

Root Cause: Hugging Face Hub and other ML libraries try to create cache directories in the root filesystem by default, but containers in Hugging Face Spaces don't have permission to write to the root directory.

Solution: This application includes comprehensive environment variable configuration to redirect all cache directories to writable locations:

Environment Variables: All cache directories are redirected to /tmp/docling_temp/
Lazy Initialization: DocumentConverter is initialized lazily to ensure environment variables are set first
Startup Script: Docker container uses a startup script that sets all necessary environment variables
Test Script: test_permissions.py verifies the environment setup

Files Modified:

src/streamlit_app.py: Environment variables set at the very beginning
src/processing/document_processor.py: Lazy initialization of DocumentConverter
Dockerfile: Environment variables and startup script
test_permissions.py: Environment verification script

Testing: Run the test script to verify the environment:

python test_permissions.py

Expected Output:

✅ ALL TESTS PASSED
🎉 All tests passed! The environment is ready for Docling.

Other Common Issues

Memory Issues

Problem: Large PDF files may cause memory issues
Solution: The application includes automatic cleanup of temporary files and memory management

Azure OpenAI Configuration

Problem: Missing or incorrect Azure OpenAI credentials

Solution: Ensure .env file contains:

AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_VERSION=your_version
AZURE_OPENAI_DEPLOYMENT=your_deployment

File Upload Issues

Problem: Files not uploading or processing
Solution: Check file size limits and ensure PDF format is supported

🔧 Development and Deployment

Local Development

Clone the repository
Install dependencies: pip install -r requirements.txt
Set up environment variables in .env
Run the test script: python test_permissions.py
Start the app: streamlit run src/streamlit_app.py

Hugging Face Spaces Deployment

Push code to repository
Ensure Dockerfile is present
Set environment variables in Spaces settings
Deploy and monitor logs for any issues

Environment Variables

The application uses these environment variables to control cache directories:

# Core temp directory
TEMP_DIR=/tmp/docling_temp

# Hugging Face Hub
HF_HOME=/tmp/docling_temp/huggingface
HF_CACHE_HOME=/tmp/docling_temp/huggingface_cache
HF_HUB_CACHE=/tmp/docling_temp/huggingface_cache

# ML Libraries
TRANSFORMERS_CACHE=/tmp/docling_temp/transformers_cache
HF_DATASETS_CACHE=/tmp/docling_temp/datasets_cache
TORCH_HOME=/tmp/docling_temp/torch
TENSORFLOW_HOME=/tmp/docling_temp/tensorflow
KERAS_HOME=/tmp/docling_temp/keras

# XDG Directories
XDG_CACHE_HOME=/tmp/docling_temp/cache
XDG_CONFIG_HOME=/tmp/docling_temp/config
XDG_DATA_HOME=/tmp/docling_temp/data

📊 Performance and Monitoring

Memory Management

Automatic cleanup of temporary files
Session state management
File size monitoring

Logging

Comprehensive logging throughout the application
In-memory log capture for UI display
Error tracking and debugging information

Caching

Hugging Face model caching in temp directories
Document processing result caching
Session state persistence

Welcome to Streamlit!

Medical Document Parser & Redactor

🎯 Overview

🏗️ Project Structure

📁 File Responsibilities

Core Processing Files

src/processing/document_processor.py

src/processing/llm_extractor.py

src/processing/sections.py

Interface Files

src/streamlit_app.py

Utility Files

src/utils/logging_utils.py

🔧 Detailed Function Documentation

Document Processing Pipeline

DocumentProcessor.process(file_path: str) -> DocumentResult

AzureO1MedicationExtractor.extract_medication_sections(doc_json: Dict) -> Dict

ReasoningSectionExtractor.remove_sections_from_json(doc_json: Dict) -> Dict

🚨 Troubleshooting

Permission Error: [Errno 13] Permission denied: '/.cache'

Other Common Issues

Memory Issues

Azure OpenAI Configuration

File Upload Issues

🔧 Development and Deployment

Local Development

Hugging Face Spaces Deployment

Environment Variables

📊 Performance and Monitoring

Memory Management

Logging

Caching

`src/processing/document_processor.py`

`src/processing/llm_extractor.py`

`src/processing/sections.py`

`src/streamlit_app.py`

`src/utils/logging_utils.py`

`DocumentProcessor.process(file_path: str) -> DocumentResult`

`AzureO1MedicationExtractor.extract_medication_sections(doc_json: Dict) -> Dict`

`ReasoningSectionExtractor.remove_sections_from_json(doc_json: Dict) -> Dict`

Permission Error: `[Errno 13] Permission denied: '/.cache'`