Spaces:
Sleeping

title: Docling
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
Welcome to Streamlit!
Edit /src/streamlit_app.py
to customize this app to your heart's desire. :heart:
If you have any questions, checkout our documentation and community forums.
Medical Document Parser & Redactor
A sophisticated medical document processing application that uses Docling (structure-aware parser) to parse PDF medical documents and automatically redact medication information using AI-powered analysis.
π― Overview
This application provides a Streamlit-based interface for uploading medical PDF documents, parsing them with Docling to extract structured content, and using Azure OpenAI to intelligently identify and redact formal medication lists while preserving clinical context.
ποΈ Project Structure
docling/
βββ src/ # Main source code
β βββ processing/ # Core processing logic
β β βββ __init__.py
β β βββ document_processor.py # Main document processing pipeline
β β βββ llm_extractor.py # Azure OpenAI integration for medication detection
β β βββ sections.py # Section extraction and redaction logic
β βββ utils/ # Utility functions
β β βββ __init__.py
β β βββ logging_utils.py # Logging configuration and handlers
β βββ streamlit_app.py # Main Streamlit application interface
βββ temp_files/ # Temporary file storage (auto-created)
βββ .env # Environment variables (Azure OpenAI credentials)
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Project configuration
βββ Dockerfile # Container configuration
βββ README.md # This file
π File Responsibilities
Core Processing Files
src/processing/document_processor.py
Purpose: Main document processing pipeline that orchestrates the entire workflow.
Key Classes:
DocumentResult
: Data class holding processed resultsDocumentProcessor
: Main processing class
Key Functions:
process(file_path)
: Main processing method_export_redacted_markdown()
: Generates redacted markdown_reconstruct_markdown_from_filtered_texts()
: Reconstructs markdown from filtered content
Responsibilities:
- Document conversion using Docling
- Section redaction coordination
- Markdown generation and reconstruction
- File persistence and logging
src/processing/llm_extractor.py
Purpose: Azure OpenAI integration for intelligent medication detection.
Key Classes:
AzureO1MedicationExtractor
: LLM-based medication extractor
Key Functions:
extract_medication_sections(doc_json)
: Main extraction method__init__()
: Azure OpenAI client initialization
Responsibilities:
- Azure OpenAI API communication
- Medication section identification
- Structured JSON response generation
- Error handling and logging
src/processing/sections.py
Purpose: Section extraction and redaction logic.
Key Classes:
ReasoningSectionExtractor
: AI-powered section extractorSectionDefinition
: Section definition data classSectionExtractor
: Traditional regex-based extractor
Key Functions:
remove_sections_from_json()
: JSON-based section removalremove_sections()
: Text-based section removal (fallback)
Responsibilities:
- Section identification and removal
- JSON structure manipulation
- Text processing and redaction
- Reasoning logging and transparency
Interface Files
src/streamlit_app.py
Purpose: Main Streamlit web application interface.
Key Functions:
save_uploaded_file()
: File upload handlingcleanup_temp_files()
: Temporary file managementcreate_diff_content()
: Diff view generation
Responsibilities:
- User interface and interaction
- File upload and management
- Visualization and diff display
- Session state management
- Download functionality
Utility Files
src/utils/logging_utils.py
Purpose: Logging configuration and management.
Key Functions:
get_log_handler()
: Creates in-memory log handlers- Log buffer management for UI display
Responsibilities:
- Logging setup and configuration
- In-memory log capture
- Log display in UI
π§ Detailed Function Documentation
Document Processing Pipeline
DocumentProcessor.process(file_path: str) -> DocumentResult
Purpose: Main entry point for document processing.
Parameters:
file_path
: Path to the PDF file to process
Returns:
DocumentResult
: Object containing all processing results
Process Flow:
- Converts PDF using Docling
- Exports structured markdown and JSON
- Applies section redaction if extractor is provided
- Persists results to temporary files
- Returns comprehensive result object
Example Usage:
processor = DocumentProcessor(section_extractor=extractor)
result = processor.process("document.pdf")
print(f"Original: {len(result.structured_markdown)} chars")
print(f"Redacted: {len(result.redacted_markdown)} chars")
AzureO1MedicationExtractor.extract_medication_sections(doc_json: Dict) -> Dict
Purpose: Uses Azure OpenAI to identify medication sections for redaction.
Parameters:
doc_json
: Docling-generated JSON structure
Returns:
- Dictionary with indices to remove and reasoning
Process Flow:
- Analyzes document structure
- Sends structured prompt to Azure OpenAI
- Parses JSON response
- Validates and limits results
- Returns structured analysis
Example Usage:
extractor = AzureO1MedicationExtractor(endpoint, api_key, version, deployment)
result = extractor.extract_medication_sections(doc_json)
print(f"Removing {len(result['indices_to_remove'])} elements")
ReasoningSectionExtractor.remove_sections_from_json(doc_json: Dict) -> Dict
Purpose: Removes identified sections from JSON structure.
Parameters:
doc_json
: Original document JSON structure
Returns:
- Redacted JSON structure
Process Flow:
- Calls LLM extractor for analysis
- Logs detailed reasoning
- Removes identified text elements
- Updates document structure
- Returns redacted JSON
π¨ Troubleshooting
Permission Error: [Errno 13] Permission denied: '/.cache'
Problem: When deploying to Hugging Face Spaces, you may encounter a permission error where the application tries to create cache directories in the root filesystem (/.cache
).
Root Cause: Hugging Face Hub and other ML libraries try to create cache directories in the root filesystem by default, but containers in Hugging Face Spaces don't have permission to write to the root directory.
Solution: This application includes comprehensive environment variable configuration to redirect all cache directories to writable locations:
- Environment Variables: All cache directories are redirected to
/tmp/docling_temp/
- Lazy Initialization: DocumentConverter is initialized lazily to ensure environment variables are set first
- Startup Script: Docker container uses a startup script that sets all necessary environment variables
- Test Script:
test_permissions.py
verifies the environment setup
Files Modified:
src/streamlit_app.py
: Environment variables set at the very beginningsrc/processing/document_processor.py
: Lazy initialization of DocumentConverterDockerfile
: Environment variables and startup scripttest_permissions.py
: Environment verification script
Testing: Run the test script to verify the environment:
python test_permissions.py
Expected Output:
β
ALL TESTS PASSED
π All tests passed! The environment is ready for Docling.
Other Common Issues
Memory Issues
- Problem: Large PDF files may cause memory issues
- Solution: The application includes automatic cleanup of temporary files and memory management
Azure OpenAI Configuration
- Problem: Missing or incorrect Azure OpenAI credentials
- Solution: Ensure
.env
file contains:AZURE_OPENAI_ENDPOINT=your_endpoint AZURE_OPENAI_KEY=your_key AZURE_OPENAI_VERSION=your_version AZURE_OPENAI_DEPLOYMENT=your_deployment
File Upload Issues
- Problem: Files not uploading or processing
- Solution: Check file size limits and ensure PDF format is supported
π§ Development and Deployment
Local Development
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables in
.env
- Run the test script:
python test_permissions.py
- Start the app:
streamlit run src/streamlit_app.py
Hugging Face Spaces Deployment
- Push code to repository
- Ensure
Dockerfile
is present - Set environment variables in Spaces settings
- Deploy and monitor logs for any issues
Environment Variables
The application uses these environment variables to control cache directories:
# Core temp directory
TEMP_DIR=/tmp/docling_temp
# Hugging Face Hub
HF_HOME=/tmp/docling_temp/huggingface
HF_CACHE_HOME=/tmp/docling_temp/huggingface_cache
HF_HUB_CACHE=/tmp/docling_temp/huggingface_cache
# ML Libraries
TRANSFORMERS_CACHE=/tmp/docling_temp/transformers_cache
HF_DATASETS_CACHE=/tmp/docling_temp/datasets_cache
TORCH_HOME=/tmp/docling_temp/torch
TENSORFLOW_HOME=/tmp/docling_temp/tensorflow
KERAS_HOME=/tmp/docling_temp/keras
# XDG Directories
XDG_CACHE_HOME=/tmp/docling_temp/cache
XDG_CONFIG_HOME=/tmp/docling_temp/config
XDG_DATA_HOME=/tmp/docling_temp/data
π Performance and Monitoring
Memory Management
- Automatic cleanup of temporary files
- Session state management
- File size monitoring
Logging
- Comprehensive logging throughout the application
- In-memory log capture for UI display
- Error tracking and debugging information
Caching
- Hugging Face model caching in temp directories
- Document processing result caching
- Session state persistence