Spaces:
Running
Running
File size: 10,181 Bytes
550df1b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 |
# Hospital Customization Evaluation System
This directory contains a comprehensive evaluation framework for analyzing the performance of hospital customization in the OnCall.ai RAG system. The system provides detailed metrics, visualizations, and insights specifically focused on hospital-only retrieval performance.
## Overview
The Hospital Customization Evaluation System evaluates three key performance metrics:
- **Metric 1 (Latency)**: Total execution time and hospital customization overhead
- **Metric 3 (Relevance)**: Average similarity scores from hospital content
- **Metric 4 (Coverage)**: Keyword overlap between generated advice and hospital content
## System Components
### Core Modules (`modules/`)
#### 1. `metrics_calculator.py`
The `HospitalCustomizationMetrics` class calculates comprehensive performance metrics:
- **Latency Analysis**: Execution time breakdown, customization overhead percentage
- **Relevance Analysis**: Hospital content similarity scores, relevance distribution
- **Coverage Analysis**: Keyword overlap, advice completeness, medical concept coverage
Key Features:
- Modular metric calculation for each performance dimension
- Statistical analysis (mean, median, std dev, min/max)
- Query type breakdown (broad/medium/specific)
- Comprehensive medical keyword dictionary for coverage analysis
#### 2. `chart_generator.py`
The `HospitalCustomizationChartGenerator` class creates publication-ready visualizations:
- **Latency Charts**: Bar charts by query type, customization breakdown pie charts
- **Relevance Charts**: Scatter plots, hospital vs general comparison charts
- **Coverage Charts**: Coverage percentage bars, keyword overlap heatmaps
- **Comprehensive Dashboard**: Multi-panel overview with key insights
Key Features:
- High-resolution PNG output with consistent styling
- Interactive color schemes and professional formatting
- Comprehensive dashboard combining all metrics
- Automatic chart organization and file management
#### 3. `query_executor.py`
Enhanced query execution with hospital-specific focus:
- **Hospital Only Mode**: Executes queries using only hospital customization
- **Detailed Logging**: Comprehensive execution metadata and timing
- **Error Handling**: Robust error management with detailed reporting
- **Batch Processing**: Efficient handling of multiple queries
### Evaluation Scripts
#### 1. `hospital_customization_evaluator.py`
Main evaluation orchestrator that:
- Coordinates all evaluation components
- Executes 6 test queries in Hospital Only mode
- Calculates comprehensive metrics
- Generates visualization charts
- Saves detailed results and reports
#### 2. `test_hospital_customization_pipeline.py`
Standalone testing script that:
- Tests core modules without full system dependencies
- Uses sample data to validate functionality
- Generates test charts and metrics
- Verifies pipeline integrity
#### 3. `run_hospital_evaluation.py`
Simple runner script for easy evaluation execution:
- User-friendly interface for running evaluations
- Clear error messages and troubleshooting tips
- Result summary and next steps guidance
## Usage Instructions
### Quick Start
1. **Basic Evaluation**:
```bash
python evaluation/run_hospital_evaluation.py
```
2. **Component Testing**:
```bash
python evaluation/test_hospital_customization_pipeline.py
```
### Advanced Usage
#### Direct Module Usage
```python
from evaluation.modules.metrics_calculator import HospitalCustomizationMetrics
from evaluation.modules.chart_generator import HospitalCustomizationChartGenerator
# Calculate metrics
calculator = HospitalCustomizationMetrics()
metrics = calculator.calculate_comprehensive_metrics(query_results)
# Generate charts
chart_gen = HospitalCustomizationChartGenerator("output/charts")
chart_files = chart_gen.generate_latency_charts(metrics)
```
#### Custom Query Execution
```python
from evaluation.modules.query_executor import QueryExecutor
executor = QueryExecutor()
queries = executor.load_queries("evaluation/queries/test_queries.json")
results = executor.execute_batch(queries, retrieval_mode="Hospital Only")
```
### Prerequisites
1. **System Requirements**:
- Python 3.8+
- OnCall.ai RAG system properly configured
- Hospital customization pipeline functional
2. **Dependencies**:
- matplotlib, seaborn (for chart generation)
- numpy (for statistical calculations)
- Standard Python libraries (json, pathlib, datetime, etc.)
3. **Environment Setup**:
```bash
source rag_env/bin/activate # Activate virtual environment
pip install matplotlib seaborn numpy # Install visualization dependencies
```
## Output Structure
### Results Directory (`results/`)
After running an evaluation, the following files are generated:
```
results/
βββ hospital_customization_evaluation_YYYYMMDD_HHMMSS.json # Complete results
βββ hospital_customization_summary_YYYYMMDD_HHMMSS.txt # Human-readable summary
βββ charts/
βββ latency_by_query_type_YYYYMMDD_HHMMSS.png
βββ customization_breakdown_YYYYMMDD_HHMMSS.png
βββ relevance_scatter_plot_YYYYMMDD_HHMMSS.png
βββ hospital_vs_general_comparison_YYYYMMDD_HHMMSS.png
βββ coverage_percentage_YYYYMMDD_HHMMSS.png
βββ hospital_customization_dashboard_YYYYMMDD_HHMMSS.png
```
### Results File Structure
The comprehensive results JSON contains:
```json
{
"evaluation_metadata": {
"timestamp": "2025-08-05T15:30:00.000000",
"evaluation_type": "hospital_customization",
"retrieval_mode": "Hospital Only",
"total_queries": 6,
"successful_queries": 6
},
"query_execution_results": {
"raw_results": [...],
"execution_summary": {...}
},
"hospital_customization_metrics": {
"metric_1_latency": {...},
"metric_3_relevance": {...},
"metric_4_coverage": {...},
"summary": {...}
},
"visualization_charts": {...},
"evaluation_insights": [...],
"recommendations": [...]
}
```
## Key Metrics Explained
### Metric 1: Latency Analysis
- **Total Execution Time**: Complete query processing duration
- **Customization Time**: Time spent on hospital-specific processing
- **Customization Percentage**: Hospital processing as % of total time
- **Query Type Breakdown**: Performance by query specificity
### Metric 3: Relevance Analysis
- **Hospital Content Relevance**: Average similarity scores for hospital guidelines
- **Relevance Distribution**: Low/Medium/High relevance score breakdown
- **Hospital vs General**: Comparison between content types
- **Quality Assessment**: Overall relevance quality rating
### Metric 4: Coverage Analysis
- **Keyword Overlap**: Percentage of medical keywords covered in advice
- **Advice Completeness**: Structural completeness assessment
- **Medical Concept Coverage**: Coverage of key medical concepts
- **Coverage Patterns**: Analysis of coverage effectiveness
## Performance Benchmarks
### Latency Performance Levels
- **Excellent**: < 30 seconds average execution time
- **Good**: 30-60 seconds average execution time
- **Needs Improvement**: > 60 seconds average execution time
### Relevance Quality Levels
- **High**: > 0.7 average relevance score
- **Medium**: 0.4-0.7 average relevance score
- **Low**: < 0.4 average relevance score
### Coverage Effectiveness Levels
- **Comprehensive**: > 70% keyword coverage
- **Adequate**: 40-70% keyword coverage
- **Limited**: < 40% keyword coverage
## Troubleshooting
### Common Issues
1. **Import Errors**:
- Ensure virtual environment is activated
- Install missing dependencies
- Check Python path configuration
2. **OnCall.ai System Not Available**:
- Use `test_hospital_customization_pipeline.py` for testing
- Verify system initialization
- Check configuration files
3. **Chart Generation Failures**:
- Install matplotlib and seaborn
- Check output directory permissions
- Verify data format integrity
4. **Missing Hospital Guidelines**:
- Verify customization pipeline is configured
- Check hospital document processing
- Ensure ANNOY indices are built
### Error Messages
- `ModuleNotFoundError: No module named 'gradio'`: Use test script instead of full system
- `Interface not initialized`: OnCall.ai system needs proper setup
- `No data available`: Check query execution results format
- `Chart generation failed`: Install visualization dependencies
## Extending the System
### Adding New Metrics
1. **Extend Metrics Calculator**:
```python
def calculate_custom_metric(self, query_results):
# Your custom metric calculation
return custom_metrics
```
2. **Add Visualization**:
```python
def generate_custom_chart(self, metrics, timestamp):
# Your custom chart generation
return chart_file_path
```
3. **Update Evaluator**:
- Include new metric in comprehensive calculation
- Add chart generation to pipeline
- Update result structure
### Custom Query Sets
1. Create new query JSON file following the existing format
2. Modify evaluator to use custom queries:
```python
queries = evaluator.load_test_queries("path/to/custom_queries.json")
```
### Integration with Other Systems
The evaluation system is designed to be modular and can be integrated with:
- Continuous integration pipelines
- Performance monitoring systems
- A/B testing frameworks
- Quality assurance workflows
## Best Practices
1. **Regular Evaluation**: Run evaluations after system changes
2. **Baseline Comparison**: Track performance changes over time
3. **Query Diversity**: Use diverse query sets for comprehensive testing
4. **Result Analysis**: Review both metrics and visualizations
5. **Action on Insights**: Use recommendations for system improvements
## Support and Maintenance
For issues, improvements, or questions:
1. Check the troubleshooting section above
2. Review error messages and logs
3. Test with the standalone pipeline tester
4. Consult the OnCall.ai system documentation
The evaluation system is designed to be self-contained and robust, providing comprehensive insights into hospital customization performance with minimal setup requirements. |