Spaces:
Running
Running
File size: 9,857 Bytes
16a2990 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 |
# Metric 5-6 LLM Judge Evaluator Manual
## Overview
The `metric5_6_llm_judge_evaluator.py` is a multi-system evaluation tool that uses Llama3-70B as a third-party judge to assess medical advice quality across different AI systems. It supports both single-system evaluation and multi-system comparison with a single LLM call for maximum consistency.
## Metrics Evaluated
**Metric 5: Clinical Actionability (θ¨εΊε―ζδ½ζ§)**
- Scale: 1-10 (normalized to 0.0-1.0)
- Question: "Can healthcare providers immediately act on this advice?"
- Target: β₯7.0/10 for acceptable actionability
**Metric 6: Clinical Evidence Quality (θ¨εΊθζεθ³ͺ)**
- Scale: 1-10 (normalized to 0.0-1.0)
- Question: "Is the advice evidence-based and follows medical standards?"
- Target: β₯7.5/10 for acceptable evidence quality
## System Architecture
### Multi-System Support
The evaluator supports flexible system combinations:
- **Single System**: `rag` or `direct`
- **Two-System Comparison**: `rag,direct`
- **Future Extension**: `rag,direct,claude,gpt4` (any combination)
### Judge LLM
- **Model**: Llama3-70B-Instruct via Hugging Face API
- **Strategy**: Single batch call for all evaluations
- **Temperature**: 0.1 (low for consistent evaluation)
- **Max Tokens**: 2048 (sufficient for evaluation responses)
## Prerequisites
### 1. Environment Setup
```bash
# Ensure HF_TOKEN is set in your environment
export HF_TOKEN="your_huggingface_token"
# Or add to .env file
echo "HF_TOKEN=your_huggingface_token" >> .env
```
### 2. Required Data Files
Before running the judge evaluator, you must have medical outputs from your systems:
**For RAG System**:
```bash
python latency_evaluator.py single_test_query.txt
# Generates: results/medical_outputs_YYYYMMDD_HHMMSS.json
```
**For Direct LLM System**:
```bash
python direct_llm_evaluator.py single_test_query.txt
# Generates: results/medical_outputs_direct_YYYYMMDD_HHMMSS.json
```
## Usage
### Command Line Interface
#### Single System Evaluation
```bash
# Evaluate RAG system only
python metric5_6_llm_judge_evaluator.py rag
# Evaluate Direct LLM system only
python metric5_6_llm_judge_evaluator.py direct
```
#### Multi-System Comparison (Recommended)
```bash
# Compare RAG vs Direct systems
python metric5_6_llm_judge_evaluator.py rag,direct
# Future: Compare multiple systems
python metric5_6_llm_judge_evaluator.py rag,direct,claude
```
### Complete Workflow Example
```bash
# Step 1: Navigate to evaluation directory
cd /path/to/GenAI-OnCallAssistant/evaluation
# Step 2: Generate medical outputs from both systems
python latency_evaluator.py single_test_query.txt
python direct_llm_evaluator.py single_test_query.txt
# Step 3: Run comparative evaluation
python metric5_6_llm_judge_evaluator.py rag,direct
```
## Output Files
### Generated Files
- **Statistics**: `results/judge_evaluation_comparison_rag_vs_direct_YYYYMMDD_HHMMSS.json`
- **Detailed Results**: Stored in evaluator's internal results array
### File Structure
```json
{
"comparison_metadata": {
"systems_compared": ["rag", "direct"],
"comparison_type": "multi_system",
"timestamp": "2025-08-04T22:00:00"
},
"category_results": {
"diagnosis": {
"average_actionability": 0.850,
"average_evidence": 0.780,
"query_count": 1,
"actionability_target_met": true,
"evidence_target_met": true
}
},
"overall_results": {
"average_actionability": 0.850,
"average_evidence": 0.780,
"successful_evaluations": 2,
"total_queries": 2,
"actionability_target_met": true,
"evidence_target_met": true
}
}
```
## Evaluation Process
### 1. File Discovery
The evaluator automatically finds the latest medical output files:
- **RAG**: `medical_outputs_*.json`
- **Direct**: `medical_outputs_direct_*.json`
- **Custom**: `medical_outputs_{system}_*.json`
### 2. Prompt Generation
For multi-system comparison, the evaluator creates a structured prompt:
```
You are a medical expert evaluating and comparing AI systems...
SYSTEM 1 (RAG): Uses medical guidelines + LLM for evidence-based advice
SYSTEM 2 (Direct): Uses LLM only without external guidelines
QUERY 1 (DIAGNOSIS):
Patient Query: 60-year-old patient with hypertension history...
SYSTEM 1 Response: For a 60-year-old patient with...
SYSTEM 2 Response: Based on the symptoms described...
RESPONSE FORMAT:
Query 1 System 1: Actionability=X, Evidence=Y
Query 1 System 2: Actionability=X, Evidence=Y
```
### 3. LLM Judge Evaluation
- **Single API Call**: All systems evaluated in one request for consistency
- **Response Parsing**: Automatic extraction of numerical scores
- **Error Handling**: Graceful handling of parsing failures
### 4. Results Analysis
- **System-Specific Statistics**: Individual performance metrics
- **Comparative Analysis**: Direct system-to-system comparison
- **Target Compliance**: Automatic threshold checking
## Expected Output
### Console Output Example
```
π§ OnCall.ai LLM Judge Evaluator - Metrics 5-6 Multi-System Evaluation
π§ͺ Multi-System Comparison: RAG vs DIRECT
π Found rag outputs: results/medical_outputs_20250804_215917.json
π Found direct outputs: results/medical_outputs_direct_20250804_220000.json
π Comparing 2 systems with 1 queries each
π― Metrics: 5 (Actionability) + 6 (Evidence Quality)
β‘ Strategy: Single comparison call for maximum consistency
π§ Multi-system comparison: rag, direct
π Evaluating 1 queries across 2 systems...
π Comparison prompt created (2150 characters)
π Calling judge LLM for multi-system comparison...
β
Judge LLM completed comparison evaluation in 45.3s
π Response length: 145 characters
π RAG: 1 evaluations parsed
π DIRECT: 1 evaluations parsed
π === LLM JUDGE EVALUATION SUMMARY ===
Systems Compared: RAG vs DIRECT
Overall Performance:
Average Actionability: 0.850 (8.5/10)
Average Evidence Quality: 0.780 (7.8/10)
Actionability Target (β₯7.0): β
Met
Evidence Target (β₯7.5): β
Met
System Breakdown:
RAG: Actionability=0.900, Evidence=0.850 [1 queries]
DIRECT: Actionability=0.800, Evidence=0.710 [1 queries]
β
LLM judge evaluation complete!
π Statistics: results/judge_evaluation_comparison_rag_vs_direct_20250804_220000.json
β‘ Efficiency: 2 evaluations in 1 LLM call
```
## Key Features
### 1. Scientific Comparison Design
- **Single Judge Call**: All systems evaluated simultaneously for consistency
- **Eliminates Temporal Bias**: Same judge, same context, same standards
- **Direct System Comparison**: Side-by-side evaluation format
### 2. Flexible Architecture
- **Backward Compatible**: Single system evaluation still supported
- **Future Extensible**: Easy to add new systems (`claude`, `gpt4`, etc.)
- **Modular Design**: Clean separation of concerns
### 3. Robust Error Handling
- **File Validation**: Automatic detection of missing input files
- **Query Count Verification**: Warns if systems have different query counts
- **Graceful Degradation**: Continues operation despite partial failures
### 4. Comprehensive Reporting
- **System-Specific Metrics**: Individual performance analysis
- **Comparative Statistics**: Direct system-to-system comparison
- **Target Compliance**: Automatic benchmark checking
- **Detailed Metadata**: Full traceability of evaluation parameters
## Troubleshooting
### Common Issues
#### 1. Missing Input Files
```
β No medical outputs files found for rag system
π‘ Please run evaluators first:
python latency_evaluator.py single_test_query.txt
```
**Solution**: Run the prerequisite evaluators to generate medical outputs.
#### 2. HF_TOKEN Not Set
```
β HF_TOKEN is missing from environment variables
```
**Solution**: Set your Hugging Face token in environment or `.env` file.
#### 3. Query Count Mismatch
```
β οΈ Warning: Systems have different query counts: {'rag': 3, 'direct': 1}
```
**Solution**: Ensure both systems processed the same input file.
#### 4. LLM API Timeout
```
β Multi-system evaluation failed: timeout
```
**Solution**: Check internet connection and Hugging Face API status.
### Debug Tips
1. **Check File Existence**: Verify medical output files in `results/` directory
2. **Validate JSON Format**: Ensure input files are properly formatted
3. **Monitor API Usage**: Check Hugging Face account limits
4. **Review Logs**: Examine detailed logging output for specific errors
## Future Extensions
### Phase 2: Generic Multi-System Framework
```bash
# Configuration-driven system comparison
python metric5_6_llm_judge_evaluator.py --config comparison_config.json
```
### Phase 3: Unlimited System Support
```bash
# Dynamic system registration
python metric5_6_llm_judge_evaluator.py med42,claude,gpt4,palm,llama2
```
### Integration with Chart Generators
```bash
# Generate comparison visualizations
python metric5_6_llm_judge_chart_generator.py rag,direct
```
## Best Practices
1. **Consistent Test Data**: Use the same query file for all systems
2. **Sequential Execution**: Complete data collection before evaluation
3. **Batch Processing**: Use multi-system mode for scientific comparison
4. **Result Verification**: Review detailed statistics files for accuracy
5. **Performance Monitoring**: Track evaluation latency and API costs
## Scientific Validity
The multi-system comparison approach provides superior scientific validity compared to separate evaluations:
- **Eliminates Judge Variability**: Same judge evaluates all systems
- **Reduces Temporal Effects**: All evaluations in single time window
- **Ensures Consistent Standards**: Identical evaluation criteria applied
- **Enables Direct Comparison**: Side-by-side system assessment
- **Maximizes Efficiency**: Single API call vs multiple separate calls
This design makes the evaluation results more reliable for research publications and system optimization decisions.
|