docling / JUPYTER_USAGE.md
levalencia's picture
Add reportlab dependency for PDF generation and enhance document processing
5d3ebd9

Jupyter Notebook Usage

This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines.

Simple Usage

from processing.document_processor import process_document_with_redaction

# Process a single document
result = process_document_with_redaction(
    file_path="path/to/your/document.pdf",
    endpoint="your-azure-openai-endpoint",
    api_key="your-azure-openai-key",
    api_version="2024-02-15-preview",
    deployment="o3-mini"  # or "o4-mini", "o3", "o4"
)

# Access the results
original_md = result.original_document_md
redacted_md = result.redacted_document_md
input_tokens = result.input_tokens
output_tokens = result.output_tokens
cost = result.cost

print(f"Processing complete!")
print(f"Input tokens: {input_tokens:,}")
print(f"Output tokens: {output_tokens:,}")
print(f"Total cost: ${cost:.4f}")

Batch Processing

import os
from processing.document_processor import process_document_with_redaction

# Configuration
AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint"
AZURE_OPENAI_KEY = "your-azure-openai-key"
AZURE_OPENAI_VERSION = "2024-02-15-preview"
AZURE_OPENAI_DEPLOYMENT = "o3-mini"

# Process multiple documents
pdf_directory = "path/to/pdf/files"
results = []

for filename in os.listdir(pdf_directory):
    if filename.endswith('.pdf'):
        file_path = os.path.join(pdf_directory, filename)
        
        print(f"Processing {filename}...")
        
        try:
            result = process_document_with_redaction(
                file_path=file_path,
                endpoint=AZURE_OPENAI_ENDPOINT,
                api_key=AZURE_OPENAI_KEY,
                api_version=AZURE_OPENAI_VERSION,
                deployment=AZURE_OPENAI_DEPLOYMENT
            )
            
            results.append({
                'filename': filename,
                'original_md': result.original_document_md,
                'redacted_md': result.redacted_document_md,
                'input_tokens': result.input_tokens,
                'output_tokens': result.output_tokens,
                'cost': result.cost
            })
            
            print(f"  βœ“ Completed - Cost: ${result.cost:.4f}")
            
        except Exception as e:
            print(f"  βœ— Error processing {filename}: {e}")

# Summary
total_cost = sum(r['cost'] for r in results)
total_input_tokens = sum(r['input_tokens'] for r in results)
total_output_tokens = sum(r['output_tokens'] for r in results)

print(f"\nBatch processing complete!")
print(f"Documents processed: {len(results)}")
print(f"Total input tokens: {total_input_tokens:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Total cost: ${total_cost:.4f}")

Environment Variables

You can also use environment variables for configuration:

import os
from dotenv import load_dotenv
from processing.document_processor import process_document_with_redaction

# Load environment variables
load_dotenv()

# Get configuration from environment
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY")
AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")

# Process document
result = process_document_with_redaction(
    file_path="document.pdf",
    endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_KEY,
    api_version=AZURE_OPENAI_VERSION,
    deployment=AZURE_OPENAI_DEPLOYMENT
)

Return Value

The function returns a ProcessingResult object with the following attributes:

  • original_document_md: Markdown version of the original document
  • redacted_document_md: Markdown version with medication sections removed
  • input_tokens: Number of input tokens used
  • output_tokens: Number of output tokens generated
  • cost: Total cost in USD

Supported Models

The function supports the following Azure OpenAI deployment names:

  • o3-mini (GPT-4o Mini) - Cheapest option
  • o4-mini (GPT-4o Mini) - Same as o3-mini
  • o3 (GPT-3.5 Turbo) - Medium cost
  • o4 (GPT-4o) - Most expensive but most capable

Error Handling

The function will raise exceptions for:

  • File not found
  • Invalid Azure OpenAI credentials
  • API rate limits
  • Network errors

Make sure to handle these appropriately in your pipeline.