Spaces:
Sleeping
Sleeping
# Jupyter Notebook Usage | |
This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines. | |
## Simple Usage | |
```python | |
from processing.document_processor import process_document_with_redaction | |
# Process a single document | |
result = process_document_with_redaction( | |
file_path="path/to/your/document.pdf", | |
endpoint="your-azure-openai-endpoint", | |
api_key="your-azure-openai-key", | |
api_version="2024-02-15-preview", | |
deployment="o3-mini" # or "o4-mini", "o3", "o4" | |
) | |
# Access the results | |
original_md = result.original_document_md | |
redacted_md = result.redacted_document_md | |
input_tokens = result.input_tokens | |
output_tokens = result.output_tokens | |
cost = result.cost | |
print(f"Processing complete!") | |
print(f"Input tokens: {input_tokens:,}") | |
print(f"Output tokens: {output_tokens:,}") | |
print(f"Total cost: ${cost:.4f}") | |
``` | |
## Batch Processing | |
```python | |
import os | |
from processing.document_processor import process_document_with_redaction | |
# Configuration | |
AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint" | |
AZURE_OPENAI_KEY = "your-azure-openai-key" | |
AZURE_OPENAI_VERSION = "2024-02-15-preview" | |
AZURE_OPENAI_DEPLOYMENT = "o3-mini" | |
# Process multiple documents | |
pdf_directory = "path/to/pdf/files" | |
results = [] | |
for filename in os.listdir(pdf_directory): | |
if filename.endswith('.pdf'): | |
file_path = os.path.join(pdf_directory, filename) | |
print(f"Processing {filename}...") | |
try: | |
result = process_document_with_redaction( | |
file_path=file_path, | |
endpoint=AZURE_OPENAI_ENDPOINT, | |
api_key=AZURE_OPENAI_KEY, | |
api_version=AZURE_OPENAI_VERSION, | |
deployment=AZURE_OPENAI_DEPLOYMENT | |
) | |
results.append({ | |
'filename': filename, | |
'original_md': result.original_document_md, | |
'redacted_md': result.redacted_document_md, | |
'input_tokens': result.input_tokens, | |
'output_tokens': result.output_tokens, | |
'cost': result.cost | |
}) | |
print(f" β Completed - Cost: ${result.cost:.4f}") | |
except Exception as e: | |
print(f" β Error processing {filename}: {e}") | |
# Summary | |
total_cost = sum(r['cost'] for r in results) | |
total_input_tokens = sum(r['input_tokens'] for r in results) | |
total_output_tokens = sum(r['output_tokens'] for r in results) | |
print(f"\nBatch processing complete!") | |
print(f"Documents processed: {len(results)}") | |
print(f"Total input tokens: {total_input_tokens:,}") | |
print(f"Total output tokens: {total_output_tokens:,}") | |
print(f"Total cost: ${total_cost:.4f}") | |
``` | |
## Environment Variables | |
You can also use environment variables for configuration: | |
```python | |
import os | |
from dotenv import load_dotenv | |
from processing.document_processor import process_document_with_redaction | |
# Load environment variables | |
load_dotenv() | |
# Get configuration from environment | |
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") | |
AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY") | |
AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION") | |
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT") | |
# Process document | |
result = process_document_with_redaction( | |
file_path="document.pdf", | |
endpoint=AZURE_OPENAI_ENDPOINT, | |
api_key=AZURE_OPENAI_KEY, | |
api_version=AZURE_OPENAI_VERSION, | |
deployment=AZURE_OPENAI_DEPLOYMENT | |
) | |
``` | |
## Return Value | |
The function returns a `ProcessingResult` object with the following attributes: | |
- `original_document_md`: Markdown version of the original document | |
- `redacted_document_md`: Markdown version with medication sections removed | |
- `input_tokens`: Number of input tokens used | |
- `output_tokens`: Number of output tokens generated | |
- `cost`: Total cost in USD | |
## Supported Models | |
The function supports the following Azure OpenAI deployment names: | |
- `o3-mini` (GPT-4o Mini) - Cheapest option | |
- `o4-mini` (GPT-4o Mini) - Same as o3-mini | |
- `o3` (GPT-3.5 Turbo) - Medium cost | |
- `o4` (GPT-4o) - Most expensive but most capable | |
## Error Handling | |
The function will raise exceptions for: | |
- File not found | |
- Invalid Azure OpenAI credentials | |
- API rate limits | |
- Network errors | |
Make sure to handle these appropriately in your pipeline. |