Spaces:

levalencia
/

doctorecord

Running

App Files Files Community

doctorecord / ARCHITECTURE.md

levalencia

feat: enhance architecture and developer documentation for clarity and detail

665cc97 2 months ago

preview code

raw

history blame contribute delete

10.3 kB

	# Architecture Documentation

	## System Overview

	The Deep-Research PDF Field Extractor is a multi-agent system designed to extract structured data from biotech-related PDFs. The system uses Azure Document Intelligence for document processing and Azure OpenAI for intelligent field extraction.

	## Core Architecture

	### Multi-Agent Design

	The system follows a multi-agent architecture where each agent has a specific responsibility:

	```
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ PDFAgent │ │ TableAgent │ │ IndexAgent │
	│ │ │ │ │ │
	│ • PDF Text │───▶│ • Table │───▶│ • Semantic │
	│ Extraction │ │ Processing │ │ Indexing │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	│
	▼
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│UniqueIndices │ │UniqueIndices │ │FieldMapper │
	│Combinator │ │LoopAgent │ │Agent │
	│ │ │ │ │ │
	│ • Extract │───▶│ • Loop through │ │ • Extract │
	│ combinations │ │ combinations │ │ individual │
	│ │ │ • Add fields │ │ fields │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	```

	### Execution Flow

	#### Original Strategy Flow
	```
	1. PDFAgent → Extract text from PDF
	2. TableAgent → Process tables with Azure DI
	3. IndexAgent → Create semantic search index
	4. ForEachField → Iterate through fields
	5. FieldMapperAgent → Extract each field value
	```

	#### Unique Indices Strategy Flow
	```
	1. PDFAgent → Extract text from PDF
	2. TableAgent → Process tables with Azure DI
	3. UniqueIndicesCombinator → Extract unique combinations
	4. UniqueIndicesLoopAgent → Extract additional fields for each combination
	```

	## Agent Details

	### PDFAgent
	- Purpose: Extract text content from PDF files
	- Technology: PyMuPDF (fitz)
	- Output: Raw text content
	- Error Handling: Graceful handling of corrupted PDFs

	### TableAgent
	- Purpose: Process tables using Azure Document Intelligence
	- Technology: Azure DI Layout Analysis
	- Features:
	- Table structure preservation
	- Rowspan/colspan handling
	- HTML table generation for debugging
	- Output: Processed table data

	### UniqueIndicesCombinator
	- Purpose: Extract unique combinations of specified indices
	- Input: Document text, unique indices descriptions
	- LLM Prompt: Structured prompt for combination extraction
	- Output: JSON array of unique combinations
	- Cost Tracking: Tracks input/output tokens

	### UniqueIndicesLoopAgent
	- Purpose: Extract additional fields for each unique combination
	- Input: Unique combinations, field descriptions
	- Process: Loops through each combination
	- LLM Calls: One call per combination
	- Error Handling: Continues with partial failures
	- Output: Complete data with all fields

	### FieldMapperAgent
	- Purpose: Extract individual field values
	- Strategies:
	- Page-by-page analysis
	- Semantic search fallback
	- Unique indices strategy
	- Features: Context-aware extraction
	- Output: Field values with confidence scores

	### IndexAgent
	- Purpose: Create semantic search indices
	- Technology: Azure OpenAI Embeddings
	- Features: Chunk-based indexing
	- Output: Searchable document index

	## Services

	### LLMClient
	```python
	class LLMClient:
	def __init__(self, settings):
	# Azure OpenAI configuration
	self._deployment = settings.AZURE_OPENAI_DEPLOYMENT
	self._max_retries = settings.LLM_MAX_RETRIES
	self._base_delay = settings.LLM_BASE_DELAY

	def responses(self, prompt, **kwargs):
	# Retry logic with exponential backoff
	# Cost tracking integration
	# Error handling
	```

	Key Features:
	- Retry logic with exponential backoff
	- Cost tracking integration
	- Error classification (retryable vs non-retryable)
	- Jitter to prevent thundering herd

	### CostTracker
	```python
	class CostTracker:
	def __init__(self):
	self.llm_calls: List[LLMCall] = []
	self.current_file_costs = {}
	self.total_costs = {}

	def add_llm_tokens(self, input_tokens, output_tokens, description):
	# Track individual LLM calls
	# Calculate costs
	# Store detailed information
	```

	Key Features:
	- Individual call tracking
	- Cost calculation based on Azure pricing
	- Detailed breakdown by operation
	- Session and total cost tracking

	### AzureDIService
	```python
	class AzureDIService:
	def extract_tables(self, pdf_bytes):
	# Azure DI Layout Analysis
	# Table structure preservation
	# HTML debugging output
	```

	Key Features:
	- Layout analysis for complex documents
	- Table structure preservation
	- Debug output generation
	- Error handling for DI operations

	## Data Flow

	### Context Management
	The system uses a context dictionary to pass data between agents:

	```python
	ctx = {
	"pdf_file": pdf_file,
	"text": extracted_text,
	"fields": field_list,
	"unique_indices": unique_indices,
	"field_descriptions": field_descriptions,
	"cost_tracker": cost_tracker,
	"results": [],
	"strategy": strategy
	}
	```

	### Result Processing
	Results are processed through multiple stages:

	1. Raw Extraction: LLM responses in JSON format
	2. Validation: JSON parsing and structure validation
	3. Flattening: Convert to tabular format
	4. DataFrame: Final structured output

	## Error Handling Strategy

	### Retry Logic
	```python
	def _should_retry(self, exception) -> bool:
	# Retry on 5xx errors
	if hasattr(exception, 'status_code'):
	return exception.status_code >= 500
	# Retry on connection errors
	return any(error in str(exception) for error in ['Timeout', 'Connection'])
	```

	### Graceful Degradation
	- Continue processing with partial failures
	- Return null values for failed extractions
	- Log detailed error information
	- Maintain cost tracking during failures

	### Error Classification
	- Retryable: 503, 500, connection timeouts
	- Non-retryable: 400, 401, validation errors
	- Fatal: Configuration errors, missing dependencies

	## Performance Considerations

	### Optimization Strategies
	1. Parallel Processing: Independent field extraction
	2. Caching: Session state for field descriptions
	3. Batching: Group similar operations
	4. Early Termination: Stop on critical failures

	### Resource Management
	- Memory: Efficient text processing
	- API Limits: Respect Azure rate limits
	- Cost Control: Detailed tracking and alerts
	- Timeout Handling: Configurable timeouts

	## Security

	### Data Protection
	- No persistent storage of sensitive data
	- Secure API key management
	- Session-based data handling
	- Log sanitization

	### Access Control
	- Environment variable configuration
	- API key validation
	- Error message sanitization

	## Monitoring and Observability

	### Logging Strategy
	```python
	# Structured logging with levels
	logger.info(f"Processing {len(combinations)} combinations")
	logger.debug(f"LLM response: {response[:200]}...")
	logger.error(f"Failed to extract field: {field}")
	```

	### Metrics Collection
	- LLM call counts and durations
	- Token usage and costs
	- Success/failure rates
	- Processing times

	### Debug Information
	- Detailed execution traces
	- Cost breakdown tables
	- Error context and stack traces
	- Performance metrics

	## Configuration Management

	### Settings Structure
	```python
	class Settings(BaseSettings):
	# Azure OpenAI
	AZURE_OPENAI_ENDPOINT: str
	AZURE_OPENAI_API_KEY: str
	AZURE_OPENAI_DEPLOYMENT: str

	# Azure Document Intelligence
	AZURE_DI_ENDPOINT: str
	AZURE_DI_KEY: str

	# Retry Configuration
	LLM_MAX_RETRIES: int = 5
	LLM_BASE_DELAY: float = 1.0
	LLM_MAX_DELAY: float = 60.0
	```

	### Environment Variables
	- `.env` file support
	- Environment variable override
	- Validation and defaults
	- Secure key management

	## Testing Strategy

	### Unit Tests
	- Individual agent testing
	- Service layer testing
	- Mock external dependencies
	- Cost tracking validation

	### Integration Tests
	- End-to-end workflows
	- Error scenario testing
	- Performance benchmarking
	- Cost accuracy validation

	### Test Coverage
	- Core functionality: 90%+
	- Error handling: 100%
	- Cost tracking: 100%
	- Retry logic: 100%

	## Deployment

	### Requirements
	- Python 3.9+
	- Azure OpenAI access
	- Azure Document Intelligence access
	- Streamlit for UI

	### Dependencies
	```
	azure-ai-documentintelligence
	openai
	streamlit
	pandas
	pymupdf
	pydantic-settings
	```

	### Environment Setup
	1. Install dependencies
	2. Configure environment variables
	3. Set up Azure resources
	4. Test connectivity
	5. Deploy application

	## Future Enhancements

	### Planned Features
	- Batch Processing: Multiple document processing
	- Custom Models: Domain-specific extraction
	- Advanced Caching: Redis-based caching
	- API Endpoints: REST API for integration
	- Real-time Processing: Streaming document processing

	### Scalability Improvements
	- Microservices: Agent separation
	- Queue System: Asynchronous processing
	- Load Balancing: Multiple instances
	- Database Integration: Persistent storage

	### Performance Optimizations
	- Vector Search: Enhanced semantic search
	- Model Optimization: Smaller, faster models
	- Parallel Processing: Multi-threaded extraction
	- Memory Optimization: Efficient data structures