Spaces:

newmindai
/

Mezura

Running

File size: 14,880 Bytes

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Tasks for different evaluation categories
class Tasks(Enum):
    # Hybrid-Benchmark Tasks (model_evaluation_mlflow_result.json metrikleri)
    accuracy = Task("OpenAI-Scores", "accuracy", "Accuracy")
    relevance = Task("OpenAI-Scores", "relevance", "Relevance")
    coherence = Task("OpenAI-Scores", "coherence", "Coherence")
    
    # BLEU Score
    bleu_mean = Task("BLEU", "mean", "BLEU Mean")
    
    # ROUGE Scores
    rouge1_mean = Task("ROUGE", "rouge1_mean", "ROUGE-1")
    rouge2_mean = Task("ROUGE", "rouge2_mean", "ROUGE-2")
    rougeL_mean = Task("ROUGE", "rougeL_mean", "ROUGE-L")
    
    # BERT Score
    bert_precision = Task("BERT-Score", "precision_mean", "BERT Precision")
    bert_recall = Task("BERT-Score", "recall_mean", "BERT Recall")
    bert_f1 = Task("BERT-Score", "f1_mean", "BERT F1")
    
    # Cosine Similarity
    cosine_turkish = Task("Cosine-Similarity", "turkish_mean", "Turkish Similarity")
    cosine_multilingual = Task("Cosine-Similarity", "multilingual_mean", "Multilingual Similarity")
    
    # Evaluation Metrics
    total_samples = Task("Evaluation-Metrics", "total_samples", "Total Samples")
    avg_input_length = Task("Evaluation-Metrics", "avg_input_length", "Avg Input Length")
    avg_prediction_length = Task("Evaluation-Metrics", "avg_prediction_length", "Avg Prediction Length")
    avg_reference_length = Task("Evaluation-Metrics", "avg_reference_length", "Avg Reference Length")

# Leaderboard title
TITLE = """<h1 align="center" id="space-title">Newmind AI LLM Evaluation Leaderboard</h1>"""

# Introduction text explaining the leaderboard
INTRODUCTION_TEXT = """
Evaluate your model's performance in the following categories:

1. ⚔️ **Auto Arena** - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.

2. 👥 **Human Arena** - Comparative evaluation based on human preferences, assessed by a reviewer group.

3. 📚 **Retrieval** - Evaluation focused on information retrieval and text generation quality for real-world applications.

4. ⚡ **Light Eval** - Fast and efficient model evaluation framework for quick testing.

5. 🔄 **EvalMix** - Multi-dimensional evaluation including lexical accuracy and semantic coherence.

6. 🐍 **Snake Bench** - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.

7. 🧩 **Structured Outputs** - Coming soon!

Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

For any questions, please contact us at info@newmind.ai
"""

# Detailed explanation of benchmarks and reproduction steps
LLM_BENCHMARKS_TEXT = """
<h2 align="center">Evaluation Categories</h2>

### 1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale
Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks.
This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.

**Key Evaluation Pillars**
- Automated Judging
Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison.
Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
- Win Probability Estimation
Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
- Skill Rating (Elo Score)
Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
- Win Rate
Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite.
Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.

To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto

### 2. 🔄 **EvalMix**
EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:

### Comprehensive Evaluation Metrics

* **LLM-as-a-Judge**: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses.
* **Lexical Metrics**: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
* **Comparative Analysis**: Enables performance comparison between multiple models on the same dataset.
* **Cosine Similarity (Turkish)**: Assesses performance in the Turkish language using Turkish-specific embedding models.
* **Cosine Similarity (Multilingual)**: Measures multilingual performance using language-agnostic embeddings.

### Generation Configuration for Evaluation

The following configuration parameters are used for model generation during evaluation:

```json
{
  "num_samples": 1100,
  "random_seed": 42,
  "temperature": 0.0,
  "max_completion_tokens": 1024
}
```

### 3. ⚡ Light-Eval
LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.

**Evaluation Tasks and Objectives**
LightEval assesses model capabilities using the following six core tasks:
- MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
- TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
- Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
- Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
- GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
- ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.

**Overall Score Calculation**
The overall performance score of a model is computed using the average of the six evaluation tasks:
LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6

To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval

### 4. 🐍 Snake-Eval
An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions.
Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating. 
This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.

**Sample Prompt:**
```
You are controlling a snake in a multi-apple Snake game. The board size is 10x10. 
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.

Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)

Your snake ID: 1 which is currently positioned at (5, 1)

Enemy snakes positions:
* Snake #2 is at position (7, 1) with body at []

Board state:
9 . . . . . A . . . .
8 . . . . . . . . . .
7 . A . . . . . . . A
6 . . . . . . . . . A
5 . . . . . . . . . .
4 . . . . . . . . . .
3 . . . . . . . . . .
2 A . . . . . . . . .
1 . . . . . 1 . 2 . .
0 . . . . . . . . . .
  0 1 2 3 4 5 6 7 8 9

--Your last move information:--
Direction: LEFT
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1). 
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
--End of your last move information.--

Rules:<
1) If you move onto an apple, you grow and gain 1 point.
2) If you hit a wall, another snake, or yourself, you die.
3) The goal is to have the most points by the end.

Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
Decreasing y coordinate: DOWN, increasing y coordinate: UP

Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.
```

To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench
 
### 5. 📚 Retrieval
An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:
- Retrieve relevant information from a knowledge base
- Generate accurate and contextually appropriate responses
- Maintain coherence between retrieved information and generated text
- Handle real-world information retrieval scenarios

**Retrieval Metrics**
   - **RAG Success Rate**: Percentage of successful retrievals
   - **Maximum Correct References**: Upper limit for correct retrievals per query
   - **Hallucinated References**: Number of irrelevant documents retrieved
   - **Missed References**: Number of relevant documents not retrieved

**LLM Judge Evaluation Metrics**
- **Legal Reasoning**: Assesses the model's ability to understand and apply legal concepts
- **Factual Legal Accuracy**: Measures accuracy of legal facts and references
- **Clarity & Precision**: Evaluates clarity and precision of responses
- **Factual Reliability**: Checks for biases and factual accuracy
- **Fluency**: Assesses language fluency and coherence
- **Relevance**: Measures response relevance to the query
- **Content Safety**: Evaluates content safety and appropriateness
    
**Judge Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

**RAG Score Calculation**
The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:

**Formula Components:**
- **RAG Success Rate** (0.9 weight): Direct percentage of successful retrievals (higher is better)
- **Normalized False Positives** (0.9 weight): Hallucinated references, min-max normalized (lower is better)
- **Normalized Max Correct References** (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)  
- **Normalized Missed References** (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)

**Final Score Formula:**
```
RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives + 
             0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0
```

### 6. 👥 Human Arena
Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.

**Evaluation Methodology**
- **Head-to-Head Comparisons**: Models are presented with the same prompts and their responses are compared by human evaluators
- **ELO Rating System**: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
- **Community Voting**: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
- **Blind Evaluation**: Evaluators see responses without knowing which model generated them, reducing bias

**Key Metrics**
- **ELO Rating**: Overall skill level based on tournament-style matchups (higher is better)
- **Win Rate**: Percentage of head-to-head victories against other models
- **Wins/Losses/Ties**: Direct comparison statistics showing model performance
- **Total Games**: Number of evaluation rounds completed
- **Votes**: Community engagement and evaluation volume
- **Provider & Technical Details**: Infrastructure and model configuration information

**Evaluation Criteria**
Human evaluators consider multiple factors when comparing model responses:
- **Response Quality**: Accuracy, completeness, and relevance of answers
- **Communication Style**: Clarity, coherence, and appropriateness of language
- **Helpfulness**: How well the response addresses the user's needs
- **Safety & Ethics**: Adherence to safety guidelines and ethical considerations
- **Creativity & Originality**: For tasks requiring creative or innovative thinking

Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.

"""

EVALUATION_QUEUE_TEXT = """
<h2 align="center">Model Evaluation Steps</h2>

### Evaluation Process:

1. **Select Evaluation Type**
   - Choose the category you want to evaluate from the dropdown menu above
   - Each category measures different capabilities (Arena, Hybrid-Benchmark, LM-Harness, etc.)

2. **Enter Parameters**
   - Input required parameters such as model name, dataset, number of samples
   - Adjust generation parameters like temperature and maximum tokens
   - Optionally enable MLflow or Weights & Biases integration

3. **Start Evaluation**
   - Click the "Submit Eval" button to start the evaluation
   - Your submission will be added to the "Pending Evaluation Queue"

4. **Monitor Evaluation**
   - When the evaluation starts, it will appear in the "Running Evaluation Queue"
   - Once completed, it will move to the "Finished Evaluations" section

5. **View Results**
   - After the evaluation is complete, results will be displayed in the "LLM Benchmark" tab
   - You can compare your model's performance with other models

### Important Limitations: 
- The model repository must be a maximum of 750 MB in size
- For trained adapters, the maximum LoRA rank must be 32
"""

CITATION_BUTTON_LABEL = "Cite these results"

CITATION_BUTTON_TEXT = r"""
@misc{newmind-mezura,
  author = {Newmind AI},
  title = {Newmind AI LLM Evaluation Leaderboard},
  year = {2025},
  publisher = {Newmind AI},
  howpublished = "\url{https://huggingface.co/spaces/newmindai/Mezura}"
}
"""