Model Trace - Hugging Face Space Explanation

Overview

This repository hosts a Hugging Face Space that creates a dynamic leaderboard for evaluating language models. The space provides a web interface where users can submit models for evaluation and view results in a ranked leaderboard format.

How It Works

Architecture

The system consists of several key components:

Frontend Interface (app.py): A Gradio web application with three main tabs:
- 🏅 LLM Benchmark: Displays the main leaderboard
- 📝 About: Shows information about the evaluation process
- 🚀 Submit here!: Allows users to submit models for evaluation
Data Storage: Uses Hugging Face datasets to store:
- Evaluation Requests: Models waiting to be evaluated
- Evaluation Results: Completed evaluation results
Evaluation Queue System: Models go through different states:
- PENDING: Submitted but not yet evaluated
- RUNNING: Currently being evaluated
- FINISHED: Evaluation completed

Data Flow

Model Submission: Users submit models through the web interface
Validation: System checks if the model exists on Hugging Face Hub and has proper metadata
Queue Management: Valid models are added to the evaluation queue
Evaluation: External evaluation system processes the models (not included in this repo)
Results Display: Completed evaluations appear in the leaderboard

Configuration

The main configuration files are:

src/envs.py: Repository settings and API tokens
src/about.py: Task definitions and leaderboard metadata
src/display/utils.py: Column definitions and display settings

Current Evaluation Tasks

The system is currently configured to evaluate models on:

ANLI (Adversarial NLI) - accuracy metric
LogiQA - normalized accuracy metric

Adding Dynamic Perplexity Testing

To add perplexity evaluation as a dynamic test, you'll need to make several modifications:

1. Update Task Configuration

First, modify src/about.py to add perplexity as a new task:

class Tasks(Enum):
    # Existing tasks
    task0 = Task("anli_r1", "acc", "ANLI")
    task1 = Task("logiqa", "acc_norm", "LogiQA")
    # Add perplexity task
    task2 = Task("perplexity", "perplexity", "Perplexity")

2. Create Perplexity Evaluation Script

Create a new file src/evaluation/perplexity_eval.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np

def evaluate_perplexity(model_name, revision="main", test_text=None):
    """
    Evaluate perplexity on a fixed piece of text.
    
    Args:
        model_name: Hugging Face model identifier
        revision: Model revision/commit hash
        test_text: Text to evaluate perplexity on (default if None)
    
    Returns:
        float: Perplexity score (lower is better)
    """
    
    # Default test text if none provided
    if test_text is None:
        test_text = """The quick brown fox jumps over the lazy dog. This is a standard test sentence that contains all the letters of the English alphabet. It is commonly used for testing fonts and keyboards."""
    
    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        revision=revision,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
    
    # Tokenize the text
    inputs = tokenizer(test_text, return_tensors="pt")
    
    # Move to same device as model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Calculate loss
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
    
    # Calculate perplexity
    perplexity = torch.exp(loss).item()
    
    return perplexity

def create_perplexity_result(model_name, revision, precision, perplexity_score):
    """
    Create a result file in the expected format.
    """
    return {
        "config": {
            "model_dtype": f"torch.{precision}",
            "model_name": model_name,
            "model_sha": revision,
        },
        "results": {
            "perplexity": {
                "perplexity": perplexity_score,
            }
        }
    }

3. Add Dynamic Evaluation Endpoint

Create a new file src/evaluation/dynamic_eval.py:

import json
import os
from datetime import datetime
from src.evaluation.perplexity_eval import evaluate_perplexity, create_perplexity_result
from src.envs import EVAL_RESULTS_PATH, API, RESULTS_REPO

def run_dynamic_perplexity_eval(model_name, revision="main", precision="float16"):
    """
    Run perplexity evaluation and save results.
    """
    try:
        # Run evaluation
        perplexity_score = evaluate_perplexity(model_name, revision)
        
        # Create result structure
        result = create_perplexity_result(model_name, revision, precision, perplexity_score)
        
        # Save result file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        result_filename = f"results_{model_name.replace('/', '_')}_{timestamp}.json"
        
        # Create directory structure
        org, model = model_name.split("/") if "/" in model_name else ("", model_name)
        result_dir = os.path.join(EVAL_RESULTS_PATH, org) if org else EVAL_RESULTS_PATH
        os.makedirs(result_dir, exist_ok=True)
        
        result_path = os.path.join(result_dir, result_filename)
        
        with open(result_path, "w") as f:
            json.dump(result, f, indent=2)
        
        # Upload to Hugging Face dataset
        API.upload_file(
            path_or_fileobj=result_path,
            path_in_repo=result_path.split("eval-results/")[1],
            repo_id=RESULTS_REPO,
            repo_type="dataset",
            commit_message=f"Add perplexity results for {model_name}",
        )
        
        return True, perplexity_score
        
    except Exception as e:
        return False, str(e)

4. Add Dynamic Testing Interface

Modify app.py to add a new tab for dynamic testing:

# Add this import
from src.evaluation.dynamic_eval import run_dynamic_perplexity_eval

# Add this function
def run_perplexity_test(model_name, revision, precision):
    """Run perplexity evaluation on demand."""
    if not model_name:
        return "Please enter a model name."
    
    success, result = run_dynamic_perplexity_eval(model_name, revision, precision)
    
    if success:
        return f"✅ Perplexity evaluation completed!\nPerplexity: {result:.4f}\n\nResults have been saved and will appear in the leaderboard shortly."
    else:
        return f"❌ Evaluation failed: {result}"

# Add this to the demo interface (inside the gr.Blocks)
with gr.TabItem("🧪 Dynamic Testing", elem_id="dynamic-testing-tab", id=4):
    gr.Markdown("## Run Perplexity Evaluation")
    
    with gr.Row():
        with gr.Column():
            dynamic_model_name = gr.Textbox(label="Model name", placeholder="org/model-name")
            dynamic_revision = gr.Textbox(label="Revision", placeholder="main", value="main")
            dynamic_precision = gr.Dropdown(
                choices=["float16", "bfloat16"],
                label="Precision",
                value="float16"
            )
        
        with gr.Column():
            dynamic_test_button = gr.Button("🚀 Run Perplexity Test", variant="primary")
            dynamic_result = gr.Markdown()
    
    dynamic_test_button.click(
        run_perplexity_test,
        [dynamic_model_name, dynamic_revision, dynamic_precision],
        dynamic_result
    )

5. Update Requirements

Add any additional dependencies to requirements.txt:

# Add if not already present
torch
transformers
accelerate

6. Configure Environment

Update src/envs.py to point to your repositories:

OWNER = "your-org-name"  # Change this

You'll need to create two Hugging Face datasets:

your-org-name/requests - for evaluation requests
your-org-name/results - for evaluation results

How to Use the Dynamic Testing

Deploy the Space: Push your changes to a Hugging Face Space
Set Environment Variables: Add HF_TOKEN with write permissions
Test Models: Use the "Dynamic Testing" tab to evaluate models on demand
View Results: Results will appear in the main leaderboard

Key Features of Dynamic Testing

On-Demand Evaluation: Test models immediately without queue
Fixed Text: Uses consistent test text for fair comparison
Automatic Ranking: Lower perplexity scores rank higher
Real-time Results: See results immediately after evaluation
Integration: Results automatically appear in the main leaderboard

Customization Options

You can customize the perplexity evaluation by:

Changing Test Text: Modify the default text in perplexity_eval.py
Adding Multiple Texts: Evaluate on multiple texts and average results
Different Metrics: Add other metrics like BLEU, ROUGE, etc.
Model Loading Options: Customize model loading parameters
Batch Processing: Process multiple models in sequence

Security Considerations

Models must be public on Hugging Face Hub
Evaluation runs in the Space's environment
Results are publicly visible
Consider rate limiting for dynamic testing

This setup provides a complete dynamic testing system that integrates seamlessly with the existing leaderboard infrastructure.

MODELS TO TEST:

'openai-community/gpt2' 'EleutherAI/gpt-neo-1.3B' 'openai-community/gpt2-large'