oncall-guide-ai / evaluation /README_HOSPITAL_CUSTOMIZATION.md
VanKee's picture
feat(evaluation): add comprehensive hospital customization evaluation system
550df1b

A newer version of the Gradio SDK is available: 5.44.1

Upgrade

Hospital Customization Evaluation System

This directory contains a comprehensive evaluation framework for analyzing the performance of hospital customization in the OnCall.ai RAG system. The system provides detailed metrics, visualizations, and insights specifically focused on hospital-only retrieval performance.

Overview

The Hospital Customization Evaluation System evaluates three key performance metrics:

  • Metric 1 (Latency): Total execution time and hospital customization overhead
  • Metric 3 (Relevance): Average similarity scores from hospital content
  • Metric 4 (Coverage): Keyword overlap between generated advice and hospital content

System Components

Core Modules (modules/)

1. metrics_calculator.py

The HospitalCustomizationMetrics class calculates comprehensive performance metrics:

  • Latency Analysis: Execution time breakdown, customization overhead percentage
  • Relevance Analysis: Hospital content similarity scores, relevance distribution
  • Coverage Analysis: Keyword overlap, advice completeness, medical concept coverage

Key Features:

  • Modular metric calculation for each performance dimension
  • Statistical analysis (mean, median, std dev, min/max)
  • Query type breakdown (broad/medium/specific)
  • Comprehensive medical keyword dictionary for coverage analysis

2. chart_generator.py

The HospitalCustomizationChartGenerator class creates publication-ready visualizations:

  • Latency Charts: Bar charts by query type, customization breakdown pie charts
  • Relevance Charts: Scatter plots, hospital vs general comparison charts
  • Coverage Charts: Coverage percentage bars, keyword overlap heatmaps
  • Comprehensive Dashboard: Multi-panel overview with key insights

Key Features:

  • High-resolution PNG output with consistent styling
  • Interactive color schemes and professional formatting
  • Comprehensive dashboard combining all metrics
  • Automatic chart organization and file management

3. query_executor.py

Enhanced query execution with hospital-specific focus:

  • Hospital Only Mode: Executes queries using only hospital customization
  • Detailed Logging: Comprehensive execution metadata and timing
  • Error Handling: Robust error management with detailed reporting
  • Batch Processing: Efficient handling of multiple queries

Evaluation Scripts

1. hospital_customization_evaluator.py

Main evaluation orchestrator that:

  • Coordinates all evaluation components
  • Executes 6 test queries in Hospital Only mode
  • Calculates comprehensive metrics
  • Generates visualization charts
  • Saves detailed results and reports

2. test_hospital_customization_pipeline.py

Standalone testing script that:

  • Tests core modules without full system dependencies
  • Uses sample data to validate functionality
  • Generates test charts and metrics
  • Verifies pipeline integrity

3. run_hospital_evaluation.py

Simple runner script for easy evaluation execution:

  • User-friendly interface for running evaluations
  • Clear error messages and troubleshooting tips
  • Result summary and next steps guidance

Usage Instructions

Quick Start

  1. Basic Evaluation:

    python evaluation/run_hospital_evaluation.py
    
  2. Component Testing:

    python evaluation/test_hospital_customization_pipeline.py
    

Advanced Usage

Direct Module Usage

from evaluation.modules.metrics_calculator import HospitalCustomizationMetrics
from evaluation.modules.chart_generator import HospitalCustomizationChartGenerator

# Calculate metrics
calculator = HospitalCustomizationMetrics()
metrics = calculator.calculate_comprehensive_metrics(query_results)

# Generate charts
chart_gen = HospitalCustomizationChartGenerator("output/charts")
chart_files = chart_gen.generate_latency_charts(metrics)

Custom Query Execution

from evaluation.modules.query_executor import QueryExecutor

executor = QueryExecutor()
queries = executor.load_queries("evaluation/queries/test_queries.json")
results = executor.execute_batch(queries, retrieval_mode="Hospital Only")

Prerequisites

  1. System Requirements:

    • Python 3.8+
    • OnCall.ai RAG system properly configured
    • Hospital customization pipeline functional
  2. Dependencies:

    • matplotlib, seaborn (for chart generation)
    • numpy (for statistical calculations)
    • Standard Python libraries (json, pathlib, datetime, etc.)
  3. Environment Setup:

    source rag_env/bin/activate  # Activate virtual environment
    pip install matplotlib seaborn numpy  # Install visualization dependencies
    

Output Structure

Results Directory (results/)

After running an evaluation, the following files are generated:

results/
β”œβ”€β”€ hospital_customization_evaluation_YYYYMMDD_HHMMSS.json  # Complete results
β”œβ”€β”€ hospital_customization_summary_YYYYMMDD_HHMMSS.txt      # Human-readable summary
└── charts/
    β”œβ”€β”€ latency_by_query_type_YYYYMMDD_HHMMSS.png
    β”œβ”€β”€ customization_breakdown_YYYYMMDD_HHMMSS.png
    β”œβ”€β”€ relevance_scatter_plot_YYYYMMDD_HHMMSS.png
    β”œβ”€β”€ hospital_vs_general_comparison_YYYYMMDD_HHMMSS.png
    β”œβ”€β”€ coverage_percentage_YYYYMMDD_HHMMSS.png
    └── hospital_customization_dashboard_YYYYMMDD_HHMMSS.png

Results File Structure

The comprehensive results JSON contains:

{
  "evaluation_metadata": {
    "timestamp": "2025-08-05T15:30:00.000000",
    "evaluation_type": "hospital_customization",
    "retrieval_mode": "Hospital Only",
    "total_queries": 6,
    "successful_queries": 6
  },
  "query_execution_results": {
    "raw_results": [...],
    "execution_summary": {...}
  },
  "hospital_customization_metrics": {
    "metric_1_latency": {...},
    "metric_3_relevance": {...},
    "metric_4_coverage": {...},
    "summary": {...}
  },
  "visualization_charts": {...},
  "evaluation_insights": [...],
  "recommendations": [...]
}

Key Metrics Explained

Metric 1: Latency Analysis

  • Total Execution Time: Complete query processing duration
  • Customization Time: Time spent on hospital-specific processing
  • Customization Percentage: Hospital processing as % of total time
  • Query Type Breakdown: Performance by query specificity

Metric 3: Relevance Analysis

  • Hospital Content Relevance: Average similarity scores for hospital guidelines
  • Relevance Distribution: Low/Medium/High relevance score breakdown
  • Hospital vs General: Comparison between content types
  • Quality Assessment: Overall relevance quality rating

Metric 4: Coverage Analysis

  • Keyword Overlap: Percentage of medical keywords covered in advice
  • Advice Completeness: Structural completeness assessment
  • Medical Concept Coverage: Coverage of key medical concepts
  • Coverage Patterns: Analysis of coverage effectiveness

Performance Benchmarks

Latency Performance Levels

  • Excellent: < 30 seconds average execution time
  • Good: 30-60 seconds average execution time
  • Needs Improvement: > 60 seconds average execution time

Relevance Quality Levels

  • High: > 0.7 average relevance score
  • Medium: 0.4-0.7 average relevance score
  • Low: < 0.4 average relevance score

Coverage Effectiveness Levels

  • Comprehensive: > 70% keyword coverage
  • Adequate: 40-70% keyword coverage
  • Limited: < 40% keyword coverage

Troubleshooting

Common Issues

  1. Import Errors:

    • Ensure virtual environment is activated
    • Install missing dependencies
    • Check Python path configuration
  2. OnCall.ai System Not Available:

    • Use test_hospital_customization_pipeline.py for testing
    • Verify system initialization
    • Check configuration files
  3. Chart Generation Failures:

    • Install matplotlib and seaborn
    • Check output directory permissions
    • Verify data format integrity
  4. Missing Hospital Guidelines:

    • Verify customization pipeline is configured
    • Check hospital document processing
    • Ensure ANNOY indices are built

Error Messages

  • ModuleNotFoundError: No module named 'gradio': Use test script instead of full system
  • Interface not initialized: OnCall.ai system needs proper setup
  • No data available: Check query execution results format
  • Chart generation failed: Install visualization dependencies

Extending the System

Adding New Metrics

  1. Extend Metrics Calculator:

    def calculate_custom_metric(self, query_results):
        # Your custom metric calculation
        return custom_metrics
    
  2. Add Visualization:

    def generate_custom_chart(self, metrics, timestamp):
        # Your custom chart generation
        return chart_file_path
    
  3. Update Evaluator:

    • Include new metric in comprehensive calculation
    • Add chart generation to pipeline
    • Update result structure

Custom Query Sets

  1. Create new query JSON file following the existing format
  2. Modify evaluator to use custom queries:
    queries = evaluator.load_test_queries("path/to/custom_queries.json")
    

Integration with Other Systems

The evaluation system is designed to be modular and can be integrated with:

  • Continuous integration pipelines
  • Performance monitoring systems
  • A/B testing frameworks
  • Quality assurance workflows

Best Practices

  1. Regular Evaluation: Run evaluations after system changes
  2. Baseline Comparison: Track performance changes over time
  3. Query Diversity: Use diverse query sets for comprehensive testing
  4. Result Analysis: Review both metrics and visualizations
  5. Action on Insights: Use recommendations for system improvements

Support and Maintenance

For issues, improvements, or questions:

  1. Check the troubleshooting section above
  2. Review error messages and logs
  3. Test with the standalone pipeline tester
  4. Consult the OnCall.ai system documentation

The evaluation system is designed to be self-contained and robust, providing comprehensive insights into hospital customization performance with minimal setup requirements.