File size: 9,738 Bytes
70ea05e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
# Model Trace - Hugging Face Space Explanation

## Overview

This repository hosts a **Hugging Face Space** that creates a dynamic leaderboard for evaluating language models. The space provides a web interface where users can submit models for evaluation and view results in a ranked leaderboard format.

## How It Works

### Architecture

The system consists of several key components:

1. **Frontend Interface** (`app.py`): A Gradio web application with three main tabs:
   - **πŸ… LLM Benchmark**: Displays the main leaderboard
   - **πŸ“ About**: Shows information about the evaluation process
   - **πŸš€ Submit here!**: Allows users to submit models for evaluation

2. **Data Storage**: Uses Hugging Face datasets to store:
   - **Evaluation Requests**: Models waiting to be evaluated
   - **Evaluation Results**: Completed evaluation results

3. **Evaluation Queue System**: Models go through different states:
   - **PENDING**: Submitted but not yet evaluated
   - **RUNNING**: Currently being evaluated
   - **FINISHED**: Evaluation completed

### Data Flow

1. **Model Submission**: Users submit models through the web interface
2. **Validation**: System checks if the model exists on Hugging Face Hub and has proper metadata
3. **Queue Management**: Valid models are added to the evaluation queue
4. **Evaluation**: External evaluation system processes the models (not included in this repo)
5. **Results Display**: Completed evaluations appear in the leaderboard

### Configuration

The main configuration files are:

- **`src/envs.py`**: Repository settings and API tokens
- **`src/about.py`**: Task definitions and leaderboard metadata
- **`src/display/utils.py`**: Column definitions and display settings

## Current Evaluation Tasks

The system is currently configured to evaluate models on:
- **ANLI** (Adversarial NLI) - accuracy metric
- **LogiQA** - normalized accuracy metric

## Adding Dynamic Perplexity Testing

To add perplexity evaluation as a dynamic test, you'll need to make several modifications:

### 1. Update Task Configuration

First, modify `src/about.py` to add perplexity as a new task:

```python
class Tasks(Enum):
    # Existing tasks
    task0 = Task("anli_r1", "acc", "ANLI")
    task1 = Task("logiqa", "acc_norm", "LogiQA")
    # Add perplexity task
    task2 = Task("perplexity", "perplexity", "Perplexity")
```

### 2. Create Perplexity Evaluation Script

Create a new file `src/evaluation/perplexity_eval.py`:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np

def evaluate_perplexity(model_name, revision="main", test_text=None):
    """
    Evaluate perplexity on a fixed piece of text.
    
    Args:
        model_name: Hugging Face model identifier
        revision: Model revision/commit hash
        test_text: Text to evaluate perplexity on (default if None)
    
    Returns:
        float: Perplexity score (lower is better)
    """
    
    # Default test text if none provided
    if test_text is None:
        test_text = """The quick brown fox jumps over the lazy dog. This is a standard test sentence that contains all the letters of the English alphabet. It is commonly used for testing fonts and keyboards."""
    
    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        revision=revision,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
    
    # Tokenize the text
    inputs = tokenizer(test_text, return_tensors="pt")
    
    # Move to same device as model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Calculate loss
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
    
    # Calculate perplexity
    perplexity = torch.exp(loss).item()
    
    return perplexity

def create_perplexity_result(model_name, revision, precision, perplexity_score):
    """
    Create a result file in the expected format.
    """
    return {
        "config": {
            "model_dtype": f"torch.{precision}",
            "model_name": model_name,
            "model_sha": revision,
        },
        "results": {
            "perplexity": {
                "perplexity": perplexity_score,
            }
        }
    }
```

### 3. Add Dynamic Evaluation Endpoint

Create a new file `src/evaluation/dynamic_eval.py`:

```python
import json
import os
from datetime import datetime
from src.evaluation.perplexity_eval import evaluate_perplexity, create_perplexity_result
from src.envs import EVAL_RESULTS_PATH, API, RESULTS_REPO

def run_dynamic_perplexity_eval(model_name, revision="main", precision="float16"):
    """
    Run perplexity evaluation and save results.
    """
    try:
        # Run evaluation
        perplexity_score = evaluate_perplexity(model_name, revision)
        
        # Create result structure
        result = create_perplexity_result(model_name, revision, precision, perplexity_score)
        
        # Save result file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        result_filename = f"results_{model_name.replace('/', '_')}_{timestamp}.json"
        
        # Create directory structure
        org, model = model_name.split("/") if "/" in model_name else ("", model_name)
        result_dir = os.path.join(EVAL_RESULTS_PATH, org) if org else EVAL_RESULTS_PATH
        os.makedirs(result_dir, exist_ok=True)
        
        result_path = os.path.join(result_dir, result_filename)
        
        with open(result_path, "w") as f:
            json.dump(result, f, indent=2)
        
        # Upload to Hugging Face dataset
        API.upload_file(
            path_or_fileobj=result_path,
            path_in_repo=result_path.split("eval-results/")[1],
            repo_id=RESULTS_REPO,
            repo_type="dataset",
            commit_message=f"Add perplexity results for {model_name}",
        )
        
        return True, perplexity_score
        
    except Exception as e:
        return False, str(e)
```

### 4. Add Dynamic Testing Interface

Modify `app.py` to add a new tab for dynamic testing:

```python
# Add this import
from src.evaluation.dynamic_eval import run_dynamic_perplexity_eval

# Add this function
def run_perplexity_test(model_name, revision, precision):
    """Run perplexity evaluation on demand."""
    if not model_name:
        return "Please enter a model name."
    
    success, result = run_dynamic_perplexity_eval(model_name, revision, precision)
    
    if success:
        return f"βœ… Perplexity evaluation completed!\nPerplexity: {result:.4f}\n\nResults have been saved and will appear in the leaderboard shortly."
    else:
        return f"❌ Evaluation failed: {result}"

# Add this to the demo interface (inside the gr.Blocks)
with gr.TabItem("πŸ§ͺ Dynamic Testing", elem_id="dynamic-testing-tab", id=4):
    gr.Markdown("## Run Perplexity Evaluation")
    
    with gr.Row():
        with gr.Column():
            dynamic_model_name = gr.Textbox(label="Model name", placeholder="org/model-name")
            dynamic_revision = gr.Textbox(label="Revision", placeholder="main", value="main")
            dynamic_precision = gr.Dropdown(
                choices=["float16", "bfloat16"],
                label="Precision",
                value="float16"
            )
        
        with gr.Column():
            dynamic_test_button = gr.Button("πŸš€ Run Perplexity Test", variant="primary")
            dynamic_result = gr.Markdown()
    
    dynamic_test_button.click(
        run_perplexity_test,
        [dynamic_model_name, dynamic_revision, dynamic_precision],
        dynamic_result
    )
```

### 5. Update Requirements

Add any additional dependencies to `requirements.txt`:

```txt
# Add if not already present
torch
transformers
accelerate
```

### 6. Configure Environment

Update `src/envs.py` to point to your repositories:

```python
OWNER = "your-org-name"  # Change this
```

You'll need to create two Hugging Face datasets:
- `your-org-name/requests` - for evaluation requests
- `your-org-name/results` - for evaluation results

## How to Use the Dynamic Testing

1. **Deploy the Space**: Push your changes to a Hugging Face Space
2. **Set Environment Variables**: Add `HF_TOKEN` with write permissions
3. **Test Models**: Use the "Dynamic Testing" tab to evaluate models on demand
4. **View Results**: Results will appear in the main leaderboard

## Key Features of Dynamic Testing

- **On-Demand Evaluation**: Test models immediately without queue
- **Fixed Text**: Uses consistent test text for fair comparison
- **Automatic Ranking**: Lower perplexity scores rank higher
- **Real-time Results**: See results immediately after evaluation
- **Integration**: Results automatically appear in the main leaderboard

## Customization Options

You can customize the perplexity evaluation by:

1. **Changing Test Text**: Modify the default text in `perplexity_eval.py`
2. **Adding Multiple Texts**: Evaluate on multiple texts and average results
3. **Different Metrics**: Add other metrics like BLEU, ROUGE, etc.
4. **Model Loading Options**: Customize model loading parameters
5. **Batch Processing**: Process multiple models in sequence

## Security Considerations

- Models must be public on Hugging Face Hub
- Evaluation runs in the Space's environment
- Results are publicly visible
- Consider rate limiting for dynamic testing

This setup provides a complete dynamic testing system that integrates seamlessly with the existing leaderboard infrastructure. 

# MODELS TO TEST:
'openai-community/gpt2'
'EleutherAI/gpt-neo-1.3B'
'openai-community/gpt2-large'