oncall-guide-ai / HOSPITAL_CUSTOMIZATION_ANALYSIS_SUMMARY.md
VanKee's picture
feat(evaluation): add comprehensive hospital customization evaluation system
550df1b

A newer version of the Gradio SDK is available: 5.44.0

Upgrade

Hospital Customization System - Tag Structure & Keyword Analysis

Executive Summary

The hospital customization system implements a sophisticated two-stage retrieval architecture with 21 medical PDFs, 134 unique medical tags, and 4,764 text chunks processed through BGE-Large-Medical embeddings and ANNOY indices.

System Architecture

Core Components

  • Embedding Model: BGE-Large-Medical (1024 dimensions)
  • Search Method: Two-stage ANNOY retrieval with angular similarity
  • Document Processing: 256-character chunks with 25-character overlap
  • Tag Structure: 134 medical concepts (symptoms + diagnoses + treatments)

Processing Pipeline

  1. Stage 1: Tag-based document filtering using medical concept embeddings
  2. Stage 2: Chunk-level retrieval within relevant documents
  3. Filtering: Top-P (0.6) + minimum similarity (0.25) thresholds

Tag Structure Analysis

Keyword Distribution

Category Count Examples
Symptoms 45 tags palpitations, dyspnea, syncope, chest pain
Diagnoses 44 tags meningitis, acute coronary syndrome, heart failure
Ambiguous/Mixed 45 tags Complex medical terms spanning categories

Frequency Patterns

  • High Frequency (3+ occurrences): palpitations, dyspnea, syncope
  • Medium Frequency (2 occurrences): chest pain, emotional distress, fever, meningitis
  • Low Frequency (1 occurrence): 121 specific medical terms

Document Coverage Analysis

Top Documents by Content Volume

  1. Chest Pain Guidelines (1,053 chunks) - Comprehensive cardiac evaluation
  2. Atrial Fibrillation Guidelines (1,047 chunks) - Complete arrhythmia management
  3. Stroke Management (703 chunks) - Acute neurological emergencies
  4. Wilson's Disease (415 chunks) - Specialized genetic condition
  5. Hereditary Angioedema (272 chunks) - Rare immune disorder

Dual Coverage (Symptoms + Diagnoses)

All 21 PDFs contain both symptom and diagnosis keywords, with top documents having:

  • Spinal Cord Emergencies: 5 symptoms, 7 diagnoses (12 total)
  • Dizziness Approach: 4 symptoms, 8 diagnoses (12 total)
  • Headache Management: 3 symptoms, 6 diagnoses (9 total)

Recommended Test Query Strategy

1. Broad Query Testing (High-Frequency Keywords)

• "palpitations" - Expected: 3 documents
• "dyspnea" - Expected: 3 documents  
• "syncope" - Expected: 3 documents
• "meningitis" - Expected: 2 documents
• "acute coronary syndrome" - Expected: 2 documents

2. Medium Specificity Testing

• "chest pain" - Expected: 2 documents
• "heart failure" - Expected: 2 documents
• "fever" - Expected: 2 documents

3. Specific Query Testing (Low-Frequency)

• "back pain" - Expected: 1 document (Spinal Cord Emergencies)
• "spinal cord compression" - Expected: 1 document
• "vertebral fracture" - Expected: 1 document

4. Combined Query Testing

• "palpitations chest pain" - Expected: Multiple documents
• "dyspnea heart failure" - Expected: Cardiac-focused results
• "fever meningitis" - Expected: Infection-focused results

5. Semantic Similarity Testing

• "emergency cardiac arrest" - Tests semantic matching beyond exact keywords
• "patient presenting with acute symptoms" - Tests broad medical query handling
• "rare genetic disorder" - Tests specialized condition retrieval

System Performance Characteristics

Expected Behavior

  • Stage 1 Filtering: Should identify 5-20 relevant tags per query
  • Document Selection: Should narrow to 2-8 relevant documents
  • Stage 2 Retrieval: Should return 3-10 high-quality chunks
  • Similarity Thresholds: 25% minimum, Top-P filtering at 60%

Quality Indicators

  • High Precision: Specific queries should return 1-2 documents
  • Good Recall: Broad queries should find all relevant documents
  • Semantic Matching: Related terms should retrieve appropriate content
  • Fallback Robustness: System should handle edge cases gracefully

Key Insights for Testing

1. Frequency-Based Test Coverage

  • Use high-frequency terms to test broad retrieval capabilities
  • Use medium-frequency terms to validate balanced precision/recall
  • Use low-frequency terms to test specific document targeting

2. Medical Domain Validation

  • BGE-Large-Medical embeddings should excel at medical concept similarity
  • System should handle medical terminology variations and synonyms
  • Diagnostic reasoning chains should be retrievable through symptom queries

3. Two-Stage Architecture Benefits

  • Tag-based filtering reduces search space efficiently
  • Chunk-level retrieval provides precise content extraction
  • Fallback mechanisms ensure robustness for edge cases

Recommendations for Query Testing

  1. Start with high-frequency keywords to validate basic system functionality
  2. Test symptom→diagnosis pathways using medically coherent combinations
  3. Validate edge cases with non-exact but semantically related queries
  4. Monitor performance metrics including precision, recall, and response times
  5. Test fallback behavior when primary retrieval fails

This analysis provides a comprehensive foundation for understanding and testing the hospital customization system's tag structure and retrieval capabilities.