|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
accuracy = Task("OpenAI-Scores", "accuracy", "Accuracy") |
|
relevance = Task("OpenAI-Scores", "relevance", "Relevance") |
|
coherence = Task("OpenAI-Scores", "coherence", "Coherence") |
|
|
|
|
|
bleu_mean = Task("BLEU", "mean", "BLEU Mean") |
|
|
|
|
|
rouge1_mean = Task("ROUGE", "rouge1_mean", "ROUGE-1") |
|
rouge2_mean = Task("ROUGE", "rouge2_mean", "ROUGE-2") |
|
rougeL_mean = Task("ROUGE", "rougeL_mean", "ROUGE-L") |
|
|
|
|
|
bert_precision = Task("BERT-Score", "precision_mean", "BERT Precision") |
|
bert_recall = Task("BERT-Score", "recall_mean", "BERT Recall") |
|
bert_f1 = Task("BERT-Score", "f1_mean", "BERT F1") |
|
|
|
|
|
cosine_turkish = Task("Cosine-Similarity", "turkish_mean", "Turkish Similarity") |
|
cosine_multilingual = Task("Cosine-Similarity", "multilingual_mean", "Multilingual Similarity") |
|
|
|
|
|
total_samples = Task("Evaluation-Metrics", "total_samples", "Total Samples") |
|
avg_input_length = Task("Evaluation-Metrics", "avg_input_length", "Avg Input Length") |
|
avg_prediction_length = Task("Evaluation-Metrics", "avg_prediction_length", "Avg Prediction Length") |
|
avg_reference_length = Task("Evaluation-Metrics", "avg_reference_length", "Avg Reference Length") |
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">Newmind AI LLM Evaluation Leaderboard</h1>""" |
|
|
|
|
|
INTRODUCTION_TEXT = """ |
|
Evaluate your model's performance in the following categories: |
|
|
|
1. ⚔️ **Auto Arena** - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system. |
|
|
|
2. 👥 **Human Arena** - Comparative evaluation based on human preferences, assessed by a reviewer group. |
|
|
|
3. 📚 **Retrieval** - Evaluation focused on information retrieval and text generation quality for real-world applications. |
|
|
|
4. ⚡ **Light Eval** - Fast and efficient model evaluation framework for quick testing. |
|
|
|
5. 🔄 **EvalMix** - Multi-dimensional evaluation including lexical accuracy and semantic coherence. |
|
|
|
6. 🐍 **Snake Bench** - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities. |
|
|
|
7. 🧩 **Structured Outputs** - Coming soon! |
|
|
|
Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence. |
|
|
|
For any questions, please contact us at info@newmind.ai |
|
""" |
|
|
|
|
|
LLM_BENCHMARKS_TEXT = """ |
|
<h2 align="center">Evaluation Categories</h2> |
|
|
|
### 1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale |
|
Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks. |
|
This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs. |
|
|
|
**Key Evaluation Pillars** |
|
- Automated Judging |
|
Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison. |
|
Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query. |
|
- Win Probability Estimation |
|
Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength. |
|
- Skill Rating (Elo Score) |
|
Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard. |
|
- Win Rate |
|
Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite. |
|
Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains. |
|
|
|
To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto |
|
|
|
### 2. 🔄 **EvalMix** |
|
EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features: |
|
|
|
### Comprehensive Evaluation Metrics |
|
|
|
* **LLM-as-a-Judge**: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses. |
|
* **Lexical Metrics**: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity. |
|
* **Comparative Analysis**: Enables performance comparison between multiple models on the same dataset. |
|
* **Cosine Similarity (Turkish)**: Assesses performance in the Turkish language using Turkish-specific embedding models. |
|
* **Cosine Similarity (Multilingual)**: Measures multilingual performance using language-agnostic embeddings. |
|
|
|
### Generation Configuration for Evaluation |
|
|
|
The following configuration parameters are used for model generation during evaluation: |
|
|
|
```json |
|
{ |
|
"num_samples": 1100, |
|
"random_seed": 42, |
|
"temperature": 0.0, |
|
"max_completion_tokens": 1024 |
|
} |
|
``` |
|
|
|
### 3. ⚡ Light-Eval |
|
LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges. |
|
|
|
**Evaluation Tasks and Objectives** |
|
LightEval assesses model capabilities using the following six core tasks: |
|
- MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only) |
|
- TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses. |
|
- Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities. |
|
- Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues. |
|
- GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities. |
|
- ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems. |
|
|
|
**Overall Score Calculation** |
|
The overall performance score of a model is computed using the average of the six evaluation tasks: |
|
LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6 |
|
|
|
To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval |
|
|
|
### 4. 🐍 Snake-Eval |
|
An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions. |
|
Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating. |
|
This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment. |
|
|
|
**Sample Prompt:** |
|
``` |
|
You are controlling a snake in a multi-apple Snake game. The board size is 10x10. |
|
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right. |
|
|
|
Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7) |
|
|
|
Your snake ID: 1 which is currently positioned at (5, 1) |
|
|
|
Enemy snakes positions: |
|
* Snake #2 is at position (7, 1) with body at [] |
|
|
|
Board state: |
|
9 . . . . . A . . . . |
|
8 . . . . . . . . . . |
|
7 . A . . . . . . . A |
|
6 . . . . . . . . . A |
|
5 . . . . . . . . . . |
|
4 . . . . . . . . . . |
|
3 . . . . . . . . . . |
|
2 A . . . . . . . . . |
|
1 . . . . . 1 . 2 . . |
|
0 . . . . . . . . . . |
|
0 1 2 3 4 5 6 7 8 9 |
|
|
|
--Your last move information:-- |
|
Direction: LEFT |
|
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1). |
|
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake. |
|
Strategy: Continue left and then maneuver upward to reach the apple at (0,2). |
|
--End of your last move information.-- |
|
|
|
Rules:< |
|
1) If you move onto an apple, you grow and gain 1 point. |
|
2) If you hit a wall, another snake, or yourself, you die. |
|
3) The goal is to have the most points by the end. |
|
|
|
Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT |
|
Decreasing y coordinate: DOWN, increasing y coordinate: UP |
|
|
|
Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT. |
|
``` |
|
|
|
To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench |
|
|
|
### 5. 📚 Retrieval |
|
An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can: |
|
- Retrieve relevant information from a knowledge base |
|
- Generate accurate and contextually appropriate responses |
|
- Maintain coherence between retrieved information and generated text |
|
- Handle real-world information retrieval scenarios |
|
|
|
**Retrieval Metrics** |
|
- **RAG Success Rate**: Percentage of successful retrievals |
|
- **Maximum Correct References**: Upper limit for correct retrievals per query |
|
- **Hallucinated References**: Number of irrelevant documents retrieved |
|
- **Missed References**: Number of relevant documents not retrieved |
|
|
|
**LLM Judge Evaluation Metrics** |
|
- **Legal Reasoning**: Assesses the model's ability to understand and apply legal concepts |
|
- **Factual Legal Accuracy**: Measures accuracy of legal facts and references |
|
- **Clarity & Precision**: Evaluates clarity and precision of responses |
|
- **Factual Reliability**: Checks for biases and factual accuracy |
|
- **Fluency**: Assesses language fluency and coherence |
|
- **Relevance**: Measures response relevance to the query |
|
- **Content Safety**: Evaluates content safety and appropriateness |
|
|
|
**Judge Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 |
|
|
|
**RAG Score Calculation** |
|
The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance: |
|
|
|
**Formula Components:** |
|
- **RAG Success Rate** (0.9 weight): Direct percentage of successful retrievals (higher is better) |
|
- **Normalized False Positives** (0.9 weight): Hallucinated references, min-max normalized (lower is better) |
|
- **Normalized Max Correct References** (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better) |
|
- **Normalized Missed References** (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better) |
|
|
|
**Final Score Formula:** |
|
``` |
|
RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives + |
|
0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0 |
|
``` |
|
|
|
### 6. 👥 Human Arena |
|
Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective. |
|
|
|
**Evaluation Methodology** |
|
- **Head-to-Head Comparisons**: Models are presented with the same prompts and their responses are compared by human evaluators |
|
- **ELO Rating System**: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models |
|
- **Community Voting**: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives |
|
- **Blind Evaluation**: Evaluators see responses without knowing which model generated them, reducing bias |
|
|
|
**Key Metrics** |
|
- **ELO Rating**: Overall skill level based on tournament-style matchups (higher is better) |
|
- **Win Rate**: Percentage of head-to-head victories against other models |
|
- **Wins/Losses/Ties**: Direct comparison statistics showing model performance |
|
- **Total Games**: Number of evaluation rounds completed |
|
- **Votes**: Community engagement and evaluation volume |
|
- **Provider & Technical Details**: Infrastructure and model configuration information |
|
|
|
**Evaluation Criteria** |
|
Human evaluators consider multiple factors when comparing model responses: |
|
- **Response Quality**: Accuracy, completeness, and relevance of answers |
|
- **Communication Style**: Clarity, coherence, and appropriateness of language |
|
- **Helpfulness**: How well the response addresses the user's needs |
|
- **Safety & Ethics**: Adherence to safety guidelines and ethical considerations |
|
- **Creativity & Originality**: For tasks requiring creative or innovative thinking |
|
|
|
Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios. |
|
|
|
""" |
|
|
|
EVALUATION_QUEUE_TEXT = """ |
|
<h2 align="center">Model Evaluation Steps</h2> |
|
|
|
### Evaluation Process: |
|
|
|
1. **Select Evaluation Type** |
|
- Choose the category you want to evaluate from the dropdown menu above |
|
- Each category measures different capabilities (Arena, Hybrid-Benchmark, LM-Harness, etc.) |
|
|
|
2. **Enter Parameters** |
|
- Input required parameters such as model name, dataset, number of samples |
|
- Adjust generation parameters like temperature and maximum tokens |
|
- Optionally enable MLflow or Weights & Biases integration |
|
|
|
3. **Start Evaluation** |
|
- Click the "Submit Eval" button to start the evaluation |
|
- Your submission will be added to the "Pending Evaluation Queue" |
|
|
|
4. **Monitor Evaluation** |
|
- When the evaluation starts, it will appear in the "Running Evaluation Queue" |
|
- Once completed, it will move to the "Finished Evaluations" section |
|
|
|
5. **View Results** |
|
- After the evaluation is complete, results will be displayed in the "LLM Benchmark" tab |
|
- You can compare your model's performance with other models |
|
|
|
### Important Limitations: |
|
- The model repository must be a maximum of 750 MB in size |
|
- For trained adapters, the maximum LoRA rank must be 32 |
|
""" |
|
|
|
CITATION_BUTTON_LABEL = "Cite these results" |
|
|
|
CITATION_BUTTON_TEXT = r""" |
|
@misc{newmind-mezura, |
|
author = {Newmind AI}, |
|
title = {Newmind AI LLM Evaluation Leaderboard}, |
|
year = {2025}, |
|
publisher = {Newmind AI}, |
|
howpublished = "\url{https://huggingface.co/spaces/newmindai/Mezura}" |
|
} |
|
""" |
|
|