Spaces:

levalencia
/

doctorecord

Sleeping

File size: 4,268 Bytes

0a40afa

# Deep-Research PDF Field Extractor

A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.

## For End Users

### Overview
The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.

### How to Use

1. **Upload Your PDF**
   - Click the "Upload PDF" button
   - Select your PDF file from your computer

2. **Specify Fields to Extract**
   - Enter the fields you want to extract, separated by commas
   - Example: `Date, Name, Value, Location, Page, FileName`

3. **Optional: Add Field Descriptions**
   - You can provide additional context about the fields.
   - This helps the system better understand what to look for

4. **Run Extraction**
   - Click the "Run extraction" button
   - Wait for the process to complete
   - View your results in a table format

5. **Download Results**
   - Download your extracted data as a CSV file
   - View execution traces and logs if needed

### Features
- Automatic document type detection
- Smart field extraction
- Support for tables and text
- Detailed execution traces
- Downloadable results and logs

## For Developers

### Architecture Overview

The application is built using a multi-agent architecture with the following components:

#### Core Components

1. **Planner (`orchestrator/planner.py`)**
   - Generates execution plans using Azure OpenAI


2. **Executor (`orchestrator/executor.py`)**
   - Executes the generated plan
   - Manages agent execution flow
   - Handles context and result management

3. **Agents**
   - `PDFAgent`: Extracts text from PDFs
   - `TableAgent`: Extracts tables from PDFs
   - `FieldMapper`: Maps fields to values
   - `ForEachField`: Control flow for field iteration

### Agent Pipeline

1. **Document Processing**
   ```python
   # Document is processed in stages:
   1. PDF text extraction
   2. Table extraction
   3. Field mapping
   ```

2. **Field Extraction Process**
   - Document type inference
   - User profile determination
   - Page-by-page scanning
   - Value extraction and validation

3. **Context Building**
   - Document metadata
   - Field descriptions
   - User context
   - Execution history

### Key Features

#### Document Type Inference
The system automatically infers document type and user profile:
```python
# Example inference:
"Document type: Analytical report
User profile: Data analysts or researchers working with document analysis"
```

#### Field Mapping
The FieldMapper agent uses a sophisticated approach:
1. Document context analysis
2. Page-by-page scanning
3. Value extraction using LLM
4. Result validation

#### Execution Traces
The system maintains detailed execution traces:
- Tool execution history
- Success/failure status
- Detailed logs
- Result storage

### Technical Setup

1. **Dependencies**
   ```python
   # Key dependencies:
   - streamlit
   - pandas
   - PyMuPDF (fitz)
   - Azure OpenAI
   - Azure Document Intelligence
   ```

2. **Configuration**
   - Environment variables for API keys
   - Prompt templates in `config/prompts.yaml`
   - Settings in `config/settings.py`

3. **Logging System**
   ```python
   # Custom logging setup:
   - LogCaptureHandler for UI display
   - Structured logging format
   - Execution history storage
   ```

### Development Guidelines

1. **Adding New Agents**
   - Inherit from base agent class
   - Implement required methods
   - Add to planner configuration

2. **Modifying Extraction Logic**
   - Update prompt templates
   - Modify field mapping logic
   - Adjust validation rules

3. **Extending Functionality**
   - Add new field types
   - Implement custom validators
   - Create new output formats

### Testing
- Unit tests for agents
- Integration tests for pipeline
- End-to-end testing with sample PDFs

### Deployment
- Streamlit app deployment
- Environment configuration
- API key management
- Logging setup

### Future Improvements
- Enhanced error handling
- Additional field types
- Improved validation
- Performance optimization
- Extended documentation