File size: 10,318 Bytes
73d57ae
e3a2ecc
73d57ae
 
 
 
 
 
e3a2ecc
73d57ae
 
 
 
 
 
 
 
 
 
bb68eb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98aae70
bb68eb6
98aae70
bb68eb6
98aae70
bb68eb6
98aae70
bb68eb6
98aae70
bb68eb6
98aae70
 
 
 
bb68eb6
98aae70
 
 
 
 
bb68eb6
98aae70
 
 
 
bb68eb6
98aae70
bb68eb6
98aae70
 
bb68eb6
 
98aae70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb68eb6
 
98aae70
bb68eb6
 
98aae70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb68eb6
 
98aae70
 
 
 
 
 
bb68eb6
98aae70
 
 
 
bb68eb6
98aae70
 
 
 
bb68eb6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: Docling
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---

# Welcome to Streamlit!

Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:

If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
forums](https://discuss.streamlit.io).

# Medical Document Parser & Redactor

A sophisticated medical document processing application that uses **Docling** (structure-aware parser) to parse PDF medical documents and automatically redact medication information using AI-powered analysis.

## 🎯 Overview

This application provides a Streamlit-based interface for uploading medical PDF documents, parsing them with Docling to extract structured content, and using Azure OpenAI to intelligently identify and redact formal medication lists while preserving clinical context.

## πŸ—οΈ Project Structure

```
docling/
β”œβ”€β”€ src/                          # Main source code
β”‚   β”œβ”€β”€ processing/               # Core processing logic
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ document_processor.py # Main document processing pipeline
β”‚   β”‚   β”œβ”€β”€ llm_extractor.py      # Azure OpenAI integration for medication detection
β”‚   β”‚   └── sections.py           # Section extraction and redaction logic
β”‚   β”œβ”€β”€ utils/                    # Utility functions
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── logging_utils.py      # Logging configuration and handlers
β”‚   └── streamlit_app.py          # Main Streamlit application interface
β”œβ”€β”€ temp_files/                   # Temporary file storage (auto-created)
β”œβ”€β”€ .env                          # Environment variables (Azure OpenAI credentials)
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ pyproject.toml               # Project configuration
β”œβ”€β”€ Dockerfile                   # Container configuration
└── README.md                    # This file
```

## πŸ“ File Responsibilities

### Core Processing Files

#### `src/processing/document_processor.py`
**Purpose**: Main document processing pipeline that orchestrates the entire workflow.

**Key Classes**:
- `DocumentResult`: Data class holding processed results
- `DocumentProcessor`: Main processing class

**Key Functions**:
- `process(file_path)`: Main processing method
- `_export_redacted_markdown()`: Generates redacted markdown
- `_reconstruct_markdown_from_filtered_texts()`: Reconstructs markdown from filtered content

**Responsibilities**:
- Document conversion using Docling
- Section redaction coordination
- Markdown generation and reconstruction
- File persistence and logging

#### `src/processing/llm_extractor.py`
**Purpose**: Azure OpenAI integration for intelligent medication detection.

**Key Classes**:
- `AzureO1MedicationExtractor`: LLM-based medication extractor

**Key Functions**:
- `extract_medication_sections(doc_json)`: Main extraction method
- `__init__()`: Azure OpenAI client initialization

**Responsibilities**:
- Azure OpenAI API communication
- Medication section identification
- Structured JSON response generation
- Error handling and logging

#### `src/processing/sections.py`
**Purpose**: Section extraction and redaction logic.

**Key Classes**:
- `ReasoningSectionExtractor`: AI-powered section extractor
- `SectionDefinition`: Section definition data class
- `SectionExtractor`: Traditional regex-based extractor

**Key Functions**:
- `remove_sections_from_json()`: JSON-based section removal
- `remove_sections()`: Text-based section removal (fallback)

**Responsibilities**:
- Section identification and removal
- JSON structure manipulation
- Text processing and redaction
- Reasoning logging and transparency

### Interface Files

#### `src/streamlit_app.py`
**Purpose**: Main Streamlit web application interface.

**Key Functions**:
- `save_uploaded_file()`: File upload handling
- `cleanup_temp_files()`: Temporary file management
- `create_diff_content()`: Diff view generation

**Responsibilities**:
- User interface and interaction
- File upload and management
- Visualization and diff display
- Session state management
- Download functionality

### Utility Files

#### `src/utils/logging_utils.py`
**Purpose**: Logging configuration and management.

**Key Functions**:
- `get_log_handler()`: Creates in-memory log handlers
- Log buffer management for UI display

**Responsibilities**:
- Logging setup and configuration
- In-memory log capture
- Log display in UI

## πŸ”§ Detailed Function Documentation

### Document Processing Pipeline

#### `DocumentProcessor.process(file_path: str) -> DocumentResult`
**Purpose**: Main entry point for document processing.

**Parameters**:
- `file_path`: Path to the PDF file to process

**Returns**:
- `DocumentResult`: Object containing all processing results

**Process Flow**:
1. Converts PDF using Docling
2. Exports structured markdown and JSON
3. Applies section redaction if extractor is provided
4. Persists results to temporary files
5. Returns comprehensive result object

**Example Usage**:
```python
processor = DocumentProcessor(section_extractor=extractor)
result = processor.process("document.pdf")
print(f"Original: {len(result.structured_markdown)} chars")
print(f"Redacted: {len(result.redacted_markdown)} chars")
```

#### `AzureO1MedicationExtractor.extract_medication_sections(doc_json: Dict) -> Dict`
**Purpose**: Uses Azure OpenAI to identify medication sections for redaction.

**Parameters**:
- `doc_json`: Docling-generated JSON structure

**Returns**:
- Dictionary with indices to remove and reasoning

**Process Flow**:
1. Analyzes document structure
2. Sends structured prompt to Azure OpenAI
3. Parses JSON response
4. Validates and limits results
5. Returns structured analysis

**Example Usage**:
```python
extractor = AzureO1MedicationExtractor(endpoint, api_key, version, deployment)
result = extractor.extract_medication_sections(doc_json)
print(f"Removing {len(result['indices_to_remove'])} elements")
```

#### `ReasoningSectionExtractor.remove_sections_from_json(doc_json: Dict) -> Dict`
**Purpose**: Removes identified sections from JSON structure.

**Parameters**:
- `doc_json`: Original document JSON structure

**Returns**:
- Redacted JSON structure

**Process Flow**:
1. Calls LLM extractor for analysis
2. Logs detailed reasoning
3. Removes identified text elements
4. Updates document structure
5. Returns redacted JSON

## 🚨 Troubleshooting

### Permission Error: `[Errno 13] Permission denied: '/.cache'`

**Problem**: When deploying to Hugging Face Spaces, you may encounter a permission error where the application tries to create cache directories in the root filesystem (`/.cache`).

**Root Cause**: Hugging Face Hub and other ML libraries try to create cache directories in the root filesystem by default, but containers in Hugging Face Spaces don't have permission to write to the root directory.

**Solution**: This application includes comprehensive environment variable configuration to redirect all cache directories to writable locations:

1. **Environment Variables**: All cache directories are redirected to `/tmp/docling_temp/`
2. **Lazy Initialization**: DocumentConverter is initialized lazily to ensure environment variables are set first
3. **Startup Script**: Docker container uses a startup script that sets all necessary environment variables
4. **Test Script**: `test_permissions.py` verifies the environment setup

**Files Modified**:
- `src/streamlit_app.py`: Environment variables set at the very beginning
- `src/processing/document_processor.py`: Lazy initialization of DocumentConverter
- `Dockerfile`: Environment variables and startup script
- `test_permissions.py`: Environment verification script

**Testing**: Run the test script to verify the environment:
```bash
python test_permissions.py
```

**Expected Output**:
```
βœ… ALL TESTS PASSED
πŸŽ‰ All tests passed! The environment is ready for Docling.
```

### Other Common Issues

#### Memory Issues
- **Problem**: Large PDF files may cause memory issues
- **Solution**: The application includes automatic cleanup of temporary files and memory management

#### Azure OpenAI Configuration
- **Problem**: Missing or incorrect Azure OpenAI credentials
- **Solution**: Ensure `.env` file contains:
  ```
  AZURE_OPENAI_ENDPOINT=your_endpoint
  AZURE_OPENAI_KEY=your_key
  AZURE_OPENAI_VERSION=your_version
  AZURE_OPENAI_DEPLOYMENT=your_deployment
  ```

#### File Upload Issues
- **Problem**: Files not uploading or processing
- **Solution**: Check file size limits and ensure PDF format is supported

## πŸ”§ Development and Deployment

### Local Development
1. Clone the repository
2. Install dependencies: `pip install -r requirements.txt`
3. Set up environment variables in `.env`
4. Run the test script: `python test_permissions.py`
5. Start the app: `streamlit run src/streamlit_app.py`

### Hugging Face Spaces Deployment
1. Push code to repository
2. Ensure `Dockerfile` is present
3. Set environment variables in Spaces settings
4. Deploy and monitor logs for any issues

### Environment Variables
The application uses these environment variables to control cache directories:

```bash
# Core temp directory
TEMP_DIR=/tmp/docling_temp

# Hugging Face Hub
HF_HOME=/tmp/docling_temp/huggingface
HF_CACHE_HOME=/tmp/docling_temp/huggingface_cache
HF_HUB_CACHE=/tmp/docling_temp/huggingface_cache

# ML Libraries
TRANSFORMERS_CACHE=/tmp/docling_temp/transformers_cache
HF_DATASETS_CACHE=/tmp/docling_temp/datasets_cache
TORCH_HOME=/tmp/docling_temp/torch
TENSORFLOW_HOME=/tmp/docling_temp/tensorflow
KERAS_HOME=/tmp/docling_temp/keras

# XDG Directories
XDG_CACHE_HOME=/tmp/docling_temp/cache
XDG_CONFIG_HOME=/tmp/docling_temp/config
XDG_DATA_HOME=/tmp/docling_temp/data
```

## πŸ“Š Performance and Monitoring

### Memory Management
- Automatic cleanup of temporary files
- Session state management
- File size monitoring

### Logging
- Comprehensive logging throughout the application
- In-memory log capture for UI display
- Error tracking and debugging information

### Caching
- Hugging Face model caching in temp directories
- Document processing result caching
- Session state persistence