Spaces:
Running
A newer version of the Gradio SDK is available:
5.44.1
Hospital Customization Evaluation System
This directory contains a comprehensive evaluation framework for analyzing the performance of hospital customization in the OnCall.ai RAG system. The system provides detailed metrics, visualizations, and insights specifically focused on hospital-only retrieval performance.
Overview
The Hospital Customization Evaluation System evaluates three key performance metrics:
- Metric 1 (Latency): Total execution time and hospital customization overhead
- Metric 3 (Relevance): Average similarity scores from hospital content
- Metric 4 (Coverage): Keyword overlap between generated advice and hospital content
System Components
Core Modules (modules/
)
1. metrics_calculator.py
The HospitalCustomizationMetrics
class calculates comprehensive performance metrics:
- Latency Analysis: Execution time breakdown, customization overhead percentage
- Relevance Analysis: Hospital content similarity scores, relevance distribution
- Coverage Analysis: Keyword overlap, advice completeness, medical concept coverage
Key Features:
- Modular metric calculation for each performance dimension
- Statistical analysis (mean, median, std dev, min/max)
- Query type breakdown (broad/medium/specific)
- Comprehensive medical keyword dictionary for coverage analysis
2. chart_generator.py
The HospitalCustomizationChartGenerator
class creates publication-ready visualizations:
- Latency Charts: Bar charts by query type, customization breakdown pie charts
- Relevance Charts: Scatter plots, hospital vs general comparison charts
- Coverage Charts: Coverage percentage bars, keyword overlap heatmaps
- Comprehensive Dashboard: Multi-panel overview with key insights
Key Features:
- High-resolution PNG output with consistent styling
- Interactive color schemes and professional formatting
- Comprehensive dashboard combining all metrics
- Automatic chart organization and file management
3. query_executor.py
Enhanced query execution with hospital-specific focus:
- Hospital Only Mode: Executes queries using only hospital customization
- Detailed Logging: Comprehensive execution metadata and timing
- Error Handling: Robust error management with detailed reporting
- Batch Processing: Efficient handling of multiple queries
Evaluation Scripts
1. hospital_customization_evaluator.py
Main evaluation orchestrator that:
- Coordinates all evaluation components
- Executes 6 test queries in Hospital Only mode
- Calculates comprehensive metrics
- Generates visualization charts
- Saves detailed results and reports
2. test_hospital_customization_pipeline.py
Standalone testing script that:
- Tests core modules without full system dependencies
- Uses sample data to validate functionality
- Generates test charts and metrics
- Verifies pipeline integrity
3. run_hospital_evaluation.py
Simple runner script for easy evaluation execution:
- User-friendly interface for running evaluations
- Clear error messages and troubleshooting tips
- Result summary and next steps guidance
Usage Instructions
Quick Start
Basic Evaluation:
python evaluation/run_hospital_evaluation.py
Component Testing:
python evaluation/test_hospital_customization_pipeline.py
Advanced Usage
Direct Module Usage
from evaluation.modules.metrics_calculator import HospitalCustomizationMetrics
from evaluation.modules.chart_generator import HospitalCustomizationChartGenerator
# Calculate metrics
calculator = HospitalCustomizationMetrics()
metrics = calculator.calculate_comprehensive_metrics(query_results)
# Generate charts
chart_gen = HospitalCustomizationChartGenerator("output/charts")
chart_files = chart_gen.generate_latency_charts(metrics)
Custom Query Execution
from evaluation.modules.query_executor import QueryExecutor
executor = QueryExecutor()
queries = executor.load_queries("evaluation/queries/test_queries.json")
results = executor.execute_batch(queries, retrieval_mode="Hospital Only")
Prerequisites
System Requirements:
- Python 3.8+
- OnCall.ai RAG system properly configured
- Hospital customization pipeline functional
Dependencies:
- matplotlib, seaborn (for chart generation)
- numpy (for statistical calculations)
- Standard Python libraries (json, pathlib, datetime, etc.)
Environment Setup:
source rag_env/bin/activate # Activate virtual environment pip install matplotlib seaborn numpy # Install visualization dependencies
Output Structure
Results Directory (results/
)
After running an evaluation, the following files are generated:
results/
βββ hospital_customization_evaluation_YYYYMMDD_HHMMSS.json # Complete results
βββ hospital_customization_summary_YYYYMMDD_HHMMSS.txt # Human-readable summary
βββ charts/
βββ latency_by_query_type_YYYYMMDD_HHMMSS.png
βββ customization_breakdown_YYYYMMDD_HHMMSS.png
βββ relevance_scatter_plot_YYYYMMDD_HHMMSS.png
βββ hospital_vs_general_comparison_YYYYMMDD_HHMMSS.png
βββ coverage_percentage_YYYYMMDD_HHMMSS.png
βββ hospital_customization_dashboard_YYYYMMDD_HHMMSS.png
Results File Structure
The comprehensive results JSON contains:
{
"evaluation_metadata": {
"timestamp": "2025-08-05T15:30:00.000000",
"evaluation_type": "hospital_customization",
"retrieval_mode": "Hospital Only",
"total_queries": 6,
"successful_queries": 6
},
"query_execution_results": {
"raw_results": [...],
"execution_summary": {...}
},
"hospital_customization_metrics": {
"metric_1_latency": {...},
"metric_3_relevance": {...},
"metric_4_coverage": {...},
"summary": {...}
},
"visualization_charts": {...},
"evaluation_insights": [...],
"recommendations": [...]
}
Key Metrics Explained
Metric 1: Latency Analysis
- Total Execution Time: Complete query processing duration
- Customization Time: Time spent on hospital-specific processing
- Customization Percentage: Hospital processing as % of total time
- Query Type Breakdown: Performance by query specificity
Metric 3: Relevance Analysis
- Hospital Content Relevance: Average similarity scores for hospital guidelines
- Relevance Distribution: Low/Medium/High relevance score breakdown
- Hospital vs General: Comparison between content types
- Quality Assessment: Overall relevance quality rating
Metric 4: Coverage Analysis
- Keyword Overlap: Percentage of medical keywords covered in advice
- Advice Completeness: Structural completeness assessment
- Medical Concept Coverage: Coverage of key medical concepts
- Coverage Patterns: Analysis of coverage effectiveness
Performance Benchmarks
Latency Performance Levels
- Excellent: < 30 seconds average execution time
- Good: 30-60 seconds average execution time
- Needs Improvement: > 60 seconds average execution time
Relevance Quality Levels
- High: > 0.7 average relevance score
- Medium: 0.4-0.7 average relevance score
- Low: < 0.4 average relevance score
Coverage Effectiveness Levels
- Comprehensive: > 70% keyword coverage
- Adequate: 40-70% keyword coverage
- Limited: < 40% keyword coverage
Troubleshooting
Common Issues
Import Errors:
- Ensure virtual environment is activated
- Install missing dependencies
- Check Python path configuration
OnCall.ai System Not Available:
- Use
test_hospital_customization_pipeline.py
for testing - Verify system initialization
- Check configuration files
- Use
Chart Generation Failures:
- Install matplotlib and seaborn
- Check output directory permissions
- Verify data format integrity
Missing Hospital Guidelines:
- Verify customization pipeline is configured
- Check hospital document processing
- Ensure ANNOY indices are built
Error Messages
ModuleNotFoundError: No module named 'gradio'
: Use test script instead of full systemInterface not initialized
: OnCall.ai system needs proper setupNo data available
: Check query execution results formatChart generation failed
: Install visualization dependencies
Extending the System
Adding New Metrics
Extend Metrics Calculator:
def calculate_custom_metric(self, query_results): # Your custom metric calculation return custom_metrics
Add Visualization:
def generate_custom_chart(self, metrics, timestamp): # Your custom chart generation return chart_file_path
Update Evaluator:
- Include new metric in comprehensive calculation
- Add chart generation to pipeline
- Update result structure
Custom Query Sets
- Create new query JSON file following the existing format
- Modify evaluator to use custom queries:
queries = evaluator.load_test_queries("path/to/custom_queries.json")
Integration with Other Systems
The evaluation system is designed to be modular and can be integrated with:
- Continuous integration pipelines
- Performance monitoring systems
- A/B testing frameworks
- Quality assurance workflows
Best Practices
- Regular Evaluation: Run evaluations after system changes
- Baseline Comparison: Track performance changes over time
- Query Diversity: Use diverse query sets for comprehensive testing
- Result Analysis: Review both metrics and visualizations
- Action on Insights: Use recommendations for system improvements
Support and Maintenance
For issues, improvements, or questions:
- Check the troubleshooting section above
- Review error messages and logs
- Test with the standalone pipeline tester
- Consult the OnCall.ai system documentation
The evaluation system is designed to be self-contained and robust, providing comprehensive insights into hospital customization performance with minimal setup requirements.