Spaces:
Sleeping
Sleeping
Jupyter Notebook Usage
This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines.
Simple Usage
from processing.document_processor import process_document_with_redaction
# Process a single document
result = process_document_with_redaction(
file_path="path/to/your/document.pdf",
endpoint="your-azure-openai-endpoint",
api_key="your-azure-openai-key",
api_version="2024-02-15-preview",
deployment="o3-mini" # or "o4-mini", "o3", "o4"
)
# Access the results
original_md = result.original_document_md
redacted_md = result.redacted_document_md
input_tokens = result.input_tokens
output_tokens = result.output_tokens
cost = result.cost
print(f"Processing complete!")
print(f"Input tokens: {input_tokens:,}")
print(f"Output tokens: {output_tokens:,}")
print(f"Total cost: ${cost:.4f}")
Batch Processing
import os
from processing.document_processor import process_document_with_redaction
# Configuration
AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint"
AZURE_OPENAI_KEY = "your-azure-openai-key"
AZURE_OPENAI_VERSION = "2024-02-15-preview"
AZURE_OPENAI_DEPLOYMENT = "o3-mini"
# Process multiple documents
pdf_directory = "path/to/pdf/files"
results = []
for filename in os.listdir(pdf_directory):
if filename.endswith('.pdf'):
file_path = os.path.join(pdf_directory, filename)
print(f"Processing {filename}...")
try:
result = process_document_with_redaction(
file_path=file_path,
endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_KEY,
api_version=AZURE_OPENAI_VERSION,
deployment=AZURE_OPENAI_DEPLOYMENT
)
results.append({
'filename': filename,
'original_md': result.original_document_md,
'redacted_md': result.redacted_document_md,
'input_tokens': result.input_tokens,
'output_tokens': result.output_tokens,
'cost': result.cost
})
print(f" β Completed - Cost: ${result.cost:.4f}")
except Exception as e:
print(f" β Error processing {filename}: {e}")
# Summary
total_cost = sum(r['cost'] for r in results)
total_input_tokens = sum(r['input_tokens'] for r in results)
total_output_tokens = sum(r['output_tokens'] for r in results)
print(f"\nBatch processing complete!")
print(f"Documents processed: {len(results)}")
print(f"Total input tokens: {total_input_tokens:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Total cost: ${total_cost:.4f}")
Environment Variables
You can also use environment variables for configuration:
import os
from dotenv import load_dotenv
from processing.document_processor import process_document_with_redaction
# Load environment variables
load_dotenv()
# Get configuration from environment
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY")
AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")
# Process document
result = process_document_with_redaction(
file_path="document.pdf",
endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_KEY,
api_version=AZURE_OPENAI_VERSION,
deployment=AZURE_OPENAI_DEPLOYMENT
)
Return Value
The function returns a ProcessingResult
object with the following attributes:
original_document_md
: Markdown version of the original documentredacted_document_md
: Markdown version with medication sections removedinput_tokens
: Number of input tokens usedoutput_tokens
: Number of output tokens generatedcost
: Total cost in USD
Supported Models
The function supports the following Azure OpenAI deployment names:
o3-mini
(GPT-4o Mini) - Cheapest optiono4-mini
(GPT-4o Mini) - Same as o3-minio3
(GPT-3.5 Turbo) - Medium costo4
(GPT-4o) - Most expensive but most capable
Error Handling
The function will raise exceptions for:
- File not found
- Invalid Azure OpenAI credentials
- API rate limits
- Network errors
Make sure to handle these appropriately in your pipeline.