Spaces:
Running
Running
# Architecture Documentation | |
## System Overview | |
The Deep-Research PDF Field Extractor is a multi-agent system designed to extract structured data from biotech-related PDFs. The system uses Azure Document Intelligence for document processing and Azure OpenAI for intelligent field extraction. | |
## Core Architecture | |
### Multi-Agent Design | |
The system follows a multi-agent architecture where each agent has a specific responsibility: | |
``` | |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
β PDFAgent β β TableAgent β β IndexAgent β | |
β β β β β β | |
β β’ PDF Text βββββΆβ β’ Table βββββΆβ β’ Semantic β | |
β Extraction β β Processing β β Indexing β | |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
β | |
βΌ | |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
βUniqueIndices β βUniqueIndices β βFieldMapper β | |
βCombinator β βLoopAgent β βAgent β | |
β β β β β β | |
β β’ Extract βββββΆβ β’ Loop through β β β’ Extract β | |
β combinations β β combinations β β individual β | |
β β β β’ Add fields β β fields β | |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
``` | |
### Execution Flow | |
#### Original Strategy Flow | |
``` | |
1. PDFAgent β Extract text from PDF | |
2. TableAgent β Process tables with Azure DI | |
3. IndexAgent β Create semantic search index | |
4. ForEachField β Iterate through fields | |
5. FieldMapperAgent β Extract each field value | |
``` | |
#### Unique Indices Strategy Flow | |
``` | |
1. PDFAgent β Extract text from PDF | |
2. TableAgent β Process tables with Azure DI | |
3. UniqueIndicesCombinator β Extract unique combinations | |
4. UniqueIndicesLoopAgent β Extract additional fields for each combination | |
``` | |
## Agent Details | |
### PDFAgent | |
- **Purpose**: Extract text content from PDF files | |
- **Technology**: PyMuPDF (fitz) | |
- **Output**: Raw text content | |
- **Error Handling**: Graceful handling of corrupted PDFs | |
### TableAgent | |
- **Purpose**: Process tables using Azure Document Intelligence | |
- **Technology**: Azure DI Layout Analysis | |
- **Features**: | |
- Table structure preservation | |
- Rowspan/colspan handling | |
- HTML table generation for debugging | |
- **Output**: Processed table data | |
### UniqueIndicesCombinator | |
- **Purpose**: Extract unique combinations of specified indices | |
- **Input**: Document text, unique indices descriptions | |
- **LLM Prompt**: Structured prompt for combination extraction | |
- **Output**: JSON array of unique combinations | |
- **Cost Tracking**: Tracks input/output tokens | |
### UniqueIndicesLoopAgent | |
- **Purpose**: Extract additional fields for each unique combination | |
- **Input**: Unique combinations, field descriptions | |
- **Process**: Loops through each combination | |
- **LLM Calls**: One call per combination | |
- **Error Handling**: Continues with partial failures | |
- **Output**: Complete data with all fields | |
### FieldMapperAgent | |
- **Purpose**: Extract individual field values | |
- **Strategies**: | |
- Page-by-page analysis | |
- Semantic search fallback | |
- Unique indices strategy | |
- **Features**: Context-aware extraction | |
- **Output**: Field values with confidence scores | |
### IndexAgent | |
- **Purpose**: Create semantic search indices | |
- **Technology**: Azure OpenAI Embeddings | |
- **Features**: Chunk-based indexing | |
- **Output**: Searchable document index | |
## Services | |
### LLMClient | |
```python | |
class LLMClient: | |
def __init__(self, settings): | |
# Azure OpenAI configuration | |
self._deployment = settings.AZURE_OPENAI_DEPLOYMENT | |
self._max_retries = settings.LLM_MAX_RETRIES | |
self._base_delay = settings.LLM_BASE_DELAY | |
def responses(self, prompt, **kwargs): | |
# Retry logic with exponential backoff | |
# Cost tracking integration | |
# Error handling | |
``` | |
**Key Features:** | |
- Retry logic with exponential backoff | |
- Cost tracking integration | |
- Error classification (retryable vs non-retryable) | |
- Jitter to prevent thundering herd | |
### CostTracker | |
```python | |
class CostTracker: | |
def __init__(self): | |
self.llm_calls: List[LLMCall] = [] | |
self.current_file_costs = {} | |
self.total_costs = {} | |
def add_llm_tokens(self, input_tokens, output_tokens, description): | |
# Track individual LLM calls | |
# Calculate costs | |
# Store detailed information | |
``` | |
**Key Features:** | |
- Individual call tracking | |
- Cost calculation based on Azure pricing | |
- Detailed breakdown by operation | |
- Session and total cost tracking | |
### AzureDIService | |
```python | |
class AzureDIService: | |
def extract_tables(self, pdf_bytes): | |
# Azure DI Layout Analysis | |
# Table structure preservation | |
# HTML debugging output | |
``` | |
**Key Features:** | |
- Layout analysis for complex documents | |
- Table structure preservation | |
- Debug output generation | |
- Error handling for DI operations | |
## Data Flow | |
### Context Management | |
The system uses a context dictionary to pass data between agents: | |
```python | |
ctx = { | |
"pdf_file": pdf_file, | |
"text": extracted_text, | |
"fields": field_list, | |
"unique_indices": unique_indices, | |
"field_descriptions": field_descriptions, | |
"cost_tracker": cost_tracker, | |
"results": [], | |
"strategy": strategy | |
} | |
``` | |
### Result Processing | |
Results are processed through multiple stages: | |
1. **Raw Extraction**: LLM responses in JSON format | |
2. **Validation**: JSON parsing and structure validation | |
3. **Flattening**: Convert to tabular format | |
4. **DataFrame**: Final structured output | |
## Error Handling Strategy | |
### Retry Logic | |
```python | |
def _should_retry(self, exception) -> bool: | |
# Retry on 5xx errors | |
if hasattr(exception, 'status_code'): | |
return exception.status_code >= 500 | |
# Retry on connection errors | |
return any(error in str(exception) for error in ['Timeout', 'Connection']) | |
``` | |
### Graceful Degradation | |
- Continue processing with partial failures | |
- Return null values for failed extractions | |
- Log detailed error information | |
- Maintain cost tracking during failures | |
### Error Classification | |
- **Retryable**: 503, 500, connection timeouts | |
- **Non-retryable**: 400, 401, validation errors | |
- **Fatal**: Configuration errors, missing dependencies | |
## Performance Considerations | |
### Optimization Strategies | |
1. **Parallel Processing**: Independent field extraction | |
2. **Caching**: Session state for field descriptions | |
3. **Batching**: Group similar operations | |
4. **Early Termination**: Stop on critical failures | |
### Resource Management | |
- **Memory**: Efficient text processing | |
- **API Limits**: Respect Azure rate limits | |
- **Cost Control**: Detailed tracking and alerts | |
- **Timeout Handling**: Configurable timeouts | |
## Security | |
### Data Protection | |
- No persistent storage of sensitive data | |
- Secure API key management | |
- Session-based data handling | |
- Log sanitization | |
### Access Control | |
- Environment variable configuration | |
- API key validation | |
- Error message sanitization | |
## Monitoring and Observability | |
### Logging Strategy | |
```python | |
# Structured logging with levels | |
logger.info(f"Processing {len(combinations)} combinations") | |
logger.debug(f"LLM response: {response[:200]}...") | |
logger.error(f"Failed to extract field: {field}") | |
``` | |
### Metrics Collection | |
- LLM call counts and durations | |
- Token usage and costs | |
- Success/failure rates | |
- Processing times | |
### Debug Information | |
- Detailed execution traces | |
- Cost breakdown tables | |
- Error context and stack traces | |
- Performance metrics | |
## Configuration Management | |
### Settings Structure | |
```python | |
class Settings(BaseSettings): | |
# Azure OpenAI | |
AZURE_OPENAI_ENDPOINT: str | |
AZURE_OPENAI_API_KEY: str | |
AZURE_OPENAI_DEPLOYMENT: str | |
# Azure Document Intelligence | |
AZURE_DI_ENDPOINT: str | |
AZURE_DI_KEY: str | |
# Retry Configuration | |
LLM_MAX_RETRIES: int = 5 | |
LLM_BASE_DELAY: float = 1.0 | |
LLM_MAX_DELAY: float = 60.0 | |
``` | |
### Environment Variables | |
- `.env` file support | |
- Environment variable override | |
- Validation and defaults | |
- Secure key management | |
## Testing Strategy | |
### Unit Tests | |
- Individual agent testing | |
- Service layer testing | |
- Mock external dependencies | |
- Cost tracking validation | |
### Integration Tests | |
- End-to-end workflows | |
- Error scenario testing | |
- Performance benchmarking | |
- Cost accuracy validation | |
### Test Coverage | |
- Core functionality: 90%+ | |
- Error handling: 100% | |
- Cost tracking: 100% | |
- Retry logic: 100% | |
## Deployment | |
### Requirements | |
- Python 3.9+ | |
- Azure OpenAI access | |
- Azure Document Intelligence access | |
- Streamlit for UI | |
### Dependencies | |
``` | |
azure-ai-documentintelligence | |
openai | |
streamlit | |
pandas | |
pymupdf | |
pydantic-settings | |
``` | |
### Environment Setup | |
1. Install dependencies | |
2. Configure environment variables | |
3. Set up Azure resources | |
4. Test connectivity | |
5. Deploy application | |
## Future Enhancements | |
### Planned Features | |
- **Batch Processing**: Multiple document processing | |
- **Custom Models**: Domain-specific extraction | |
- **Advanced Caching**: Redis-based caching | |
- **API Endpoints**: REST API for integration | |
- **Real-time Processing**: Streaming document processing | |
### Scalability Improvements | |
- **Microservices**: Agent separation | |
- **Queue System**: Asynchronous processing | |
- **Load Balancing**: Multiple instances | |
- **Database Integration**: Persistent storage | |
### Performance Optimizations | |
- **Vector Search**: Enhanced semantic search | |
- **Model Optimization**: Smaller, faster models | |
- **Parallel Processing**: Multi-threaded extraction | |
- **Memory Optimization**: Efficient data structures |