Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.42.0
Model Trace - Hugging Face Space Explanation
Overview
This repository hosts a Hugging Face Space that creates a dynamic leaderboard for evaluating language models. The space provides a web interface where users can submit models for evaluation and view results in a ranked leaderboard format.
How It Works
Architecture
The system consists of several key components:
Frontend Interface (
app.py
): A Gradio web application with three main tabs:- π LLM Benchmark: Displays the main leaderboard
- π About: Shows information about the evaluation process
- π Submit here!: Allows users to submit models for evaluation
Data Storage: Uses Hugging Face datasets to store:
- Evaluation Requests: Models waiting to be evaluated
- Evaluation Results: Completed evaluation results
Evaluation Queue System: Models go through different states:
- PENDING: Submitted but not yet evaluated
- RUNNING: Currently being evaluated
- FINISHED: Evaluation completed
Data Flow
- Model Submission: Users submit models through the web interface
- Validation: System checks if the model exists on Hugging Face Hub and has proper metadata
- Queue Management: Valid models are added to the evaluation queue
- Evaluation: External evaluation system processes the models (not included in this repo)
- Results Display: Completed evaluations appear in the leaderboard
Configuration
The main configuration files are:
src/envs.py
: Repository settings and API tokenssrc/about.py
: Task definitions and leaderboard metadatasrc/display/utils.py
: Column definitions and display settings
Current Evaluation Tasks
The system is currently configured to evaluate models on:
- ANLI (Adversarial NLI) - accuracy metric
- LogiQA - normalized accuracy metric
Adding Dynamic Perplexity Testing
To add perplexity evaluation as a dynamic test, you'll need to make several modifications:
1. Update Task Configuration
First, modify src/about.py
to add perplexity as a new task:
class Tasks(Enum):
# Existing tasks
task0 = Task("anli_r1", "acc", "ANLI")
task1 = Task("logiqa", "acc_norm", "LogiQA")
# Add perplexity task
task2 = Task("perplexity", "perplexity", "Perplexity")
2. Create Perplexity Evaluation Script
Create a new file src/evaluation/perplexity_eval.py
:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
def evaluate_perplexity(model_name, revision="main", test_text=None):
"""
Evaluate perplexity on a fixed piece of text.
Args:
model_name: Hugging Face model identifier
revision: Model revision/commit hash
test_text: Text to evaluate perplexity on (default if None)
Returns:
float: Perplexity score (lower is better)
"""
# Default test text if none provided
if test_text is None:
test_text = """The quick brown fox jumps over the lazy dog. This is a standard test sentence that contains all the letters of the English alphabet. It is commonly used for testing fonts and keyboards."""
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
revision=revision,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
# Tokenize the text
inputs = tokenizer(test_text, return_tensors="pt")
# Move to same device as model
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Calculate loss
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
# Calculate perplexity
perplexity = torch.exp(loss).item()
return perplexity
def create_perplexity_result(model_name, revision, precision, perplexity_score):
"""
Create a result file in the expected format.
"""
return {
"config": {
"model_dtype": f"torch.{precision}",
"model_name": model_name,
"model_sha": revision,
},
"results": {
"perplexity": {
"perplexity": perplexity_score,
}
}
}
3. Add Dynamic Evaluation Endpoint
Create a new file src/evaluation/dynamic_eval.py
:
import json
import os
from datetime import datetime
from src.evaluation.perplexity_eval import evaluate_perplexity, create_perplexity_result
from src.envs import EVAL_RESULTS_PATH, API, RESULTS_REPO
def run_dynamic_perplexity_eval(model_name, revision="main", precision="float16"):
"""
Run perplexity evaluation and save results.
"""
try:
# Run evaluation
perplexity_score = evaluate_perplexity(model_name, revision)
# Create result structure
result = create_perplexity_result(model_name, revision, precision, perplexity_score)
# Save result file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
result_filename = f"results_{model_name.replace('/', '_')}_{timestamp}.json"
# Create directory structure
org, model = model_name.split("/") if "/" in model_name else ("", model_name)
result_dir = os.path.join(EVAL_RESULTS_PATH, org) if org else EVAL_RESULTS_PATH
os.makedirs(result_dir, exist_ok=True)
result_path = os.path.join(result_dir, result_filename)
with open(result_path, "w") as f:
json.dump(result, f, indent=2)
# Upload to Hugging Face dataset
API.upload_file(
path_or_fileobj=result_path,
path_in_repo=result_path.split("eval-results/")[1],
repo_id=RESULTS_REPO,
repo_type="dataset",
commit_message=f"Add perplexity results for {model_name}",
)
return True, perplexity_score
except Exception as e:
return False, str(e)
4. Add Dynamic Testing Interface
Modify app.py
to add a new tab for dynamic testing:
# Add this import
from src.evaluation.dynamic_eval import run_dynamic_perplexity_eval
# Add this function
def run_perplexity_test(model_name, revision, precision):
"""Run perplexity evaluation on demand."""
if not model_name:
return "Please enter a model name."
success, result = run_dynamic_perplexity_eval(model_name, revision, precision)
if success:
return f"β
Perplexity evaluation completed!\nPerplexity: {result:.4f}\n\nResults have been saved and will appear in the leaderboard shortly."
else:
return f"β Evaluation failed: {result}"
# Add this to the demo interface (inside the gr.Blocks)
with gr.TabItem("π§ͺ Dynamic Testing", elem_id="dynamic-testing-tab", id=4):
gr.Markdown("## Run Perplexity Evaluation")
with gr.Row():
with gr.Column():
dynamic_model_name = gr.Textbox(label="Model name", placeholder="org/model-name")
dynamic_revision = gr.Textbox(label="Revision", placeholder="main", value="main")
dynamic_precision = gr.Dropdown(
choices=["float16", "bfloat16"],
label="Precision",
value="float16"
)
with gr.Column():
dynamic_test_button = gr.Button("π Run Perplexity Test", variant="primary")
dynamic_result = gr.Markdown()
dynamic_test_button.click(
run_perplexity_test,
[dynamic_model_name, dynamic_revision, dynamic_precision],
dynamic_result
)
5. Update Requirements
Add any additional dependencies to requirements.txt
:
# Add if not already present
torch
transformers
accelerate
6. Configure Environment
Update src/envs.py
to point to your repositories:
OWNER = "your-org-name" # Change this
You'll need to create two Hugging Face datasets:
your-org-name/requests
- for evaluation requestsyour-org-name/results
- for evaluation results
How to Use the Dynamic Testing
- Deploy the Space: Push your changes to a Hugging Face Space
- Set Environment Variables: Add
HF_TOKEN
with write permissions - Test Models: Use the "Dynamic Testing" tab to evaluate models on demand
- View Results: Results will appear in the main leaderboard
Key Features of Dynamic Testing
- On-Demand Evaluation: Test models immediately without queue
- Fixed Text: Uses consistent test text for fair comparison
- Automatic Ranking: Lower perplexity scores rank higher
- Real-time Results: See results immediately after evaluation
- Integration: Results automatically appear in the main leaderboard
Customization Options
You can customize the perplexity evaluation by:
- Changing Test Text: Modify the default text in
perplexity_eval.py
- Adding Multiple Texts: Evaluate on multiple texts and average results
- Different Metrics: Add other metrics like BLEU, ROUGE, etc.
- Model Loading Options: Customize model loading parameters
- Batch Processing: Process multiple models in sequence
Security Considerations
- Models must be public on Hugging Face Hub
- Evaluation runs in the Space's environment
- Results are publicly visible
- Consider rate limiting for dynamic testing
This setup provides a complete dynamic testing system that integrates seamlessly with the existing leaderboard infrastructure.
MODELS TO TEST:
'openai-community/gpt2' 'EleutherAI/gpt-neo-1.3B' 'openai-community/gpt2-large'