Spaces:

levalencia
/

doctorecord

Running

File size: 10,280 Bytes

# Architecture Documentation

## System Overview

The Deep-Research PDF Field Extractor is a multi-agent system designed to extract structured data from biotech-related PDFs. The system uses Azure Document Intelligence for document processing and Azure OpenAI for intelligent field extraction.

## Core Architecture

### Multi-Agent Design

The system follows a multi-agent architecture where each agent has a specific responsibility:

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PDFAgent      │    │  TableAgent     │    │  IndexAgent     │
│                 │    │                 │    │                 │
│ • PDF Text      │───▶│ • Table         │───▶│ • Semantic      │
│   Extraction    │    │   Processing    │    │   Indexing      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│UniqueIndices    │    │UniqueIndices    │    │FieldMapper      │
│Combinator       │    │LoopAgent        │    │Agent            │
│                 │    │                 │    │                 │
│ • Extract       │───▶│ • Loop through  │    │ • Extract       │
│   combinations  │    │   combinations  │    │   individual    │
│                 │    │ • Add fields    │    │   fields        │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```

### Execution Flow

#### Original Strategy Flow
```
1. PDFAgent → Extract text from PDF
2. TableAgent → Process tables with Azure DI
3. IndexAgent → Create semantic search index
4. ForEachField → Iterate through fields
5. FieldMapperAgent → Extract each field value
```

#### Unique Indices Strategy Flow
```
1. PDFAgent → Extract text from PDF
2. TableAgent → Process tables with Azure DI
3. UniqueIndicesCombinator → Extract unique combinations
4. UniqueIndicesLoopAgent → Extract additional fields for each combination
```

## Agent Details

### PDFAgent
- **Purpose**: Extract text content from PDF files
- **Technology**: PyMuPDF (fitz)
- **Output**: Raw text content
- **Error Handling**: Graceful handling of corrupted PDFs

### TableAgent
- **Purpose**: Process tables using Azure Document Intelligence
- **Technology**: Azure DI Layout Analysis
- **Features**: 
  - Table structure preservation
  - Rowspan/colspan handling
  - HTML table generation for debugging
- **Output**: Processed table data

### UniqueIndicesCombinator
- **Purpose**: Extract unique combinations of specified indices
- **Input**: Document text, unique indices descriptions
- **LLM Prompt**: Structured prompt for combination extraction
- **Output**: JSON array of unique combinations
- **Cost Tracking**: Tracks input/output tokens

### UniqueIndicesLoopAgent
- **Purpose**: Extract additional fields for each unique combination
- **Input**: Unique combinations, field descriptions
- **Process**: Loops through each combination
- **LLM Calls**: One call per combination
- **Error Handling**: Continues with partial failures
- **Output**: Complete data with all fields

### FieldMapperAgent
- **Purpose**: Extract individual field values
- **Strategies**: 
  - Page-by-page analysis
  - Semantic search fallback
  - Unique indices strategy
- **Features**: Context-aware extraction
- **Output**: Field values with confidence scores

### IndexAgent
- **Purpose**: Create semantic search indices
- **Technology**: Azure OpenAI Embeddings
- **Features**: Chunk-based indexing
- **Output**: Searchable document index

## Services

### LLMClient
```python
class LLMClient:
    def __init__(self, settings):
        # Azure OpenAI configuration
        self._deployment = settings.AZURE_OPENAI_DEPLOYMENT
        self._max_retries = settings.LLM_MAX_RETRIES
        self._base_delay = settings.LLM_BASE_DELAY
    
    def responses(self, prompt, **kwargs):
        # Retry logic with exponential backoff
        # Cost tracking integration
        # Error handling
```

**Key Features:**
- Retry logic with exponential backoff
- Cost tracking integration
- Error classification (retryable vs non-retryable)
- Jitter to prevent thundering herd

### CostTracker
```python
class CostTracker:
    def __init__(self):
        self.llm_calls: List[LLMCall] = []
        self.current_file_costs = {}
        self.total_costs = {}
    
    def add_llm_tokens(self, input_tokens, output_tokens, description):
        # Track individual LLM calls
        # Calculate costs
        # Store detailed information
```

**Key Features:**
- Individual call tracking
- Cost calculation based on Azure pricing
- Detailed breakdown by operation
- Session and total cost tracking

### AzureDIService
```python
class AzureDIService:
    def extract_tables(self, pdf_bytes):
        # Azure DI Layout Analysis
        # Table structure preservation
        # HTML debugging output
```

**Key Features:**
- Layout analysis for complex documents
- Table structure preservation
- Debug output generation
- Error handling for DI operations

## Data Flow

### Context Management
The system uses a context dictionary to pass data between agents:

```python
ctx = {
    "pdf_file": pdf_file,
    "text": extracted_text,
    "fields": field_list,
    "unique_indices": unique_indices,
    "field_descriptions": field_descriptions,
    "cost_tracker": cost_tracker,
    "results": [],
    "strategy": strategy
}
```

### Result Processing
Results are processed through multiple stages:

1. **Raw Extraction**: LLM responses in JSON format
2. **Validation**: JSON parsing and structure validation
3. **Flattening**: Convert to tabular format
4. **DataFrame**: Final structured output

## Error Handling Strategy

### Retry Logic
```python
def _should_retry(self, exception) -> bool:
    # Retry on 5xx errors
    if hasattr(exception, 'status_code'):
        return exception.status_code >= 500
    # Retry on connection errors
    return any(error in str(exception) for error in ['Timeout', 'Connection'])
```

### Graceful Degradation
- Continue processing with partial failures
- Return null values for failed extractions
- Log detailed error information
- Maintain cost tracking during failures

### Error Classification
- **Retryable**: 503, 500, connection timeouts
- **Non-retryable**: 400, 401, validation errors
- **Fatal**: Configuration errors, missing dependencies

## Performance Considerations

### Optimization Strategies
1. **Parallel Processing**: Independent field extraction
2. **Caching**: Session state for field descriptions
3. **Batching**: Group similar operations
4. **Early Termination**: Stop on critical failures

### Resource Management
- **Memory**: Efficient text processing
- **API Limits**: Respect Azure rate limits
- **Cost Control**: Detailed tracking and alerts
- **Timeout Handling**: Configurable timeouts

## Security

### Data Protection
- No persistent storage of sensitive data
- Secure API key management
- Session-based data handling
- Log sanitization

### Access Control
- Environment variable configuration
- API key validation
- Error message sanitization

## Monitoring and Observability

### Logging Strategy
```python
# Structured logging with levels
logger.info(f"Processing {len(combinations)} combinations")
logger.debug(f"LLM response: {response[:200]}...")
logger.error(f"Failed to extract field: {field}")
```

### Metrics Collection
- LLM call counts and durations
- Token usage and costs
- Success/failure rates
- Processing times

### Debug Information
- Detailed execution traces
- Cost breakdown tables
- Error context and stack traces
- Performance metrics

## Configuration Management

### Settings Structure
```python
class Settings(BaseSettings):
    # Azure OpenAI
    AZURE_OPENAI_ENDPOINT: str
    AZURE_OPENAI_API_KEY: str
    AZURE_OPENAI_DEPLOYMENT: str
    
    # Azure Document Intelligence
    AZURE_DI_ENDPOINT: str
    AZURE_DI_KEY: str
    
    # Retry Configuration
    LLM_MAX_RETRIES: int = 5
    LLM_BASE_DELAY: float = 1.0
    LLM_MAX_DELAY: float = 60.0
```

### Environment Variables
- `.env` file support
- Environment variable override
- Validation and defaults
- Secure key management

## Testing Strategy

### Unit Tests
- Individual agent testing
- Service layer testing
- Mock external dependencies
- Cost tracking validation

### Integration Tests
- End-to-end workflows
- Error scenario testing
- Performance benchmarking
- Cost accuracy validation

### Test Coverage
- Core functionality: 90%+
- Error handling: 100%
- Cost tracking: 100%
- Retry logic: 100%

## Deployment

### Requirements
- Python 3.9+
- Azure OpenAI access
- Azure Document Intelligence access
- Streamlit for UI

### Dependencies
```
azure-ai-documentintelligence
openai
streamlit
pandas
pymupdf
pydantic-settings
```

### Environment Setup
1. Install dependencies
2. Configure environment variables
3. Set up Azure resources
4. Test connectivity
5. Deploy application

## Future Enhancements

### Planned Features
- **Batch Processing**: Multiple document processing
- **Custom Models**: Domain-specific extraction
- **Advanced Caching**: Redis-based caching
- **API Endpoints**: REST API for integration
- **Real-time Processing**: Streaming document processing

### Scalability Improvements
- **Microservices**: Agent separation
- **Queue System**: Asynchronous processing
- **Load Balancing**: Multiple instances
- **Database Integration**: Persistent storage

### Performance Optimizations
- **Vector Search**: Enhanced semantic search
- **Model Optimization**: Smaller, faster models
- **Parallel Processing**: Multi-threaded extraction
- **Memory Optimization**: Efficient data structures