Spaces:
Sleeping
Sleeping
File size: 4,268 Bytes
0a40afa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
# Deep-Research PDF Field Extractor
A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.
## For End Users
### Overview
The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
### How to Use
1. **Upload Your PDF**
- Click the "Upload PDF" button
- Select your PDF file from your computer
2. **Specify Fields to Extract**
- Enter the fields you want to extract, separated by commas
- Example: `Date, Name, Value, Location, Page, FileName`
3. **Optional: Add Field Descriptions**
- You can provide additional context about the fields.
- This helps the system better understand what to look for
4. **Run Extraction**
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
5. **Download Results**
- Download your extracted data as a CSV file
- View execution traces and logs if needed
### Features
- Automatic document type detection
- Smart field extraction
- Support for tables and text
- Detailed execution traces
- Downloadable results and logs
## For Developers
### Architecture Overview
The application is built using a multi-agent architecture with the following components:
#### Core Components
1. **Planner (`orchestrator/planner.py`)**
- Generates execution plans using Azure OpenAI
2. **Executor (`orchestrator/executor.py`)**
- Executes the generated plan
- Manages agent execution flow
- Handles context and result management
3. **Agents**
- `PDFAgent`: Extracts text from PDFs
- `TableAgent`: Extracts tables from PDFs
- `FieldMapper`: Maps fields to values
- `ForEachField`: Control flow for field iteration
### Agent Pipeline
1. **Document Processing**
```python
# Document is processed in stages:
1. PDF text extraction
2. Table extraction
3. Field mapping
```
2. **Field Extraction Process**
- Document type inference
- User profile determination
- Page-by-page scanning
- Value extraction and validation
3. **Context Building**
- Document metadata
- Field descriptions
- User context
- Execution history
### Key Features
#### Document Type Inference
The system automatically infers document type and user profile:
```python
# Example inference:
"Document type: Analytical report
User profile: Data analysts or researchers working with document analysis"
```
#### Field Mapping
The FieldMapper agent uses a sophisticated approach:
1. Document context analysis
2. Page-by-page scanning
3. Value extraction using LLM
4. Result validation
#### Execution Traces
The system maintains detailed execution traces:
- Tool execution history
- Success/failure status
- Detailed logs
- Result storage
### Technical Setup
1. **Dependencies**
```python
# Key dependencies:
- streamlit
- pandas
- PyMuPDF (fitz)
- Azure OpenAI
- Azure Document Intelligence
```
2. **Configuration**
- Environment variables for API keys
- Prompt templates in `config/prompts.yaml`
- Settings in `config/settings.py`
3. **Logging System**
```python
# Custom logging setup:
- LogCaptureHandler for UI display
- Structured logging format
- Execution history storage
```
### Development Guidelines
1. **Adding New Agents**
- Inherit from base agent class
- Implement required methods
- Add to planner configuration
2. **Modifying Extraction Logic**
- Update prompt templates
- Modify field mapping logic
- Adjust validation rules
3. **Extending Functionality**
- Add new field types
- Implement custom validators
- Create new output formats
### Testing
- Unit tests for agents
- Integration tests for pipeline
- End-to-end testing with sample PDFs
### Deployment
- Streamlit app deployment
- Environment configuration
- API key management
- Logging setup
### Future Improvements
- Enhanced error handling
- Additional field types
- Improved validation
- Performance optimization
- Extended documentation
|