File size: 14,880 Bytes
3232d64 1326dcc 3232d64 8c404fc 3232d64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Tasks for different evaluation categories
class Tasks(Enum):
# Hybrid-Benchmark Tasks (model_evaluation_mlflow_result.json metrikleri)
accuracy = Task("OpenAI-Scores", "accuracy", "Accuracy")
relevance = Task("OpenAI-Scores", "relevance", "Relevance")
coherence = Task("OpenAI-Scores", "coherence", "Coherence")
# BLEU Score
bleu_mean = Task("BLEU", "mean", "BLEU Mean")
# ROUGE Scores
rouge1_mean = Task("ROUGE", "rouge1_mean", "ROUGE-1")
rouge2_mean = Task("ROUGE", "rouge2_mean", "ROUGE-2")
rougeL_mean = Task("ROUGE", "rougeL_mean", "ROUGE-L")
# BERT Score
bert_precision = Task("BERT-Score", "precision_mean", "BERT Precision")
bert_recall = Task("BERT-Score", "recall_mean", "BERT Recall")
bert_f1 = Task("BERT-Score", "f1_mean", "BERT F1")
# Cosine Similarity
cosine_turkish = Task("Cosine-Similarity", "turkish_mean", "Turkish Similarity")
cosine_multilingual = Task("Cosine-Similarity", "multilingual_mean", "Multilingual Similarity")
# Evaluation Metrics
total_samples = Task("Evaluation-Metrics", "total_samples", "Total Samples")
avg_input_length = Task("Evaluation-Metrics", "avg_input_length", "Avg Input Length")
avg_prediction_length = Task("Evaluation-Metrics", "avg_prediction_length", "Avg Prediction Length")
avg_reference_length = Task("Evaluation-Metrics", "avg_reference_length", "Avg Reference Length")
# Leaderboard title
TITLE = """<h1 align="center" id="space-title">Newmind AI LLM Evaluation Leaderboard</h1>"""
# Introduction text explaining the leaderboard
INTRODUCTION_TEXT = """
Evaluate your model's performance in the following categories:
1. ⚔️ **Auto Arena** - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.
2. 👥 **Human Arena** - Comparative evaluation based on human preferences, assessed by a reviewer group.
3. 📚 **Retrieval** - Evaluation focused on information retrieval and text generation quality for real-world applications.
4. ⚡ **Light Eval** - Fast and efficient model evaluation framework for quick testing.
5. 🔄 **EvalMix** - Multi-dimensional evaluation including lexical accuracy and semantic coherence.
6. 🐍 **Snake Bench** - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.
7. 🧩 **Structured Outputs** - Coming soon!
Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.
For any questions, please contact us at info@newmind.ai
"""
# Detailed explanation of benchmarks and reproduction steps
LLM_BENCHMARKS_TEXT = """
<h2 align="center">Evaluation Categories</h2>
### 1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale
Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks.
This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.
**Key Evaluation Pillars**
- Automated Judging
Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison.
Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
- Win Probability Estimation
Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
- Skill Rating (Elo Score)
Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
- Win Rate
Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite.
Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.
To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto
### 2. 🔄 **EvalMix**
EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:
### Comprehensive Evaluation Metrics
* **LLM-as-a-Judge**: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses.
* **Lexical Metrics**: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
* **Comparative Analysis**: Enables performance comparison between multiple models on the same dataset.
* **Cosine Similarity (Turkish)**: Assesses performance in the Turkish language using Turkish-specific embedding models.
* **Cosine Similarity (Multilingual)**: Measures multilingual performance using language-agnostic embeddings.
### Generation Configuration for Evaluation
The following configuration parameters are used for model generation during evaluation:
```json
{
"num_samples": 1100,
"random_seed": 42,
"temperature": 0.0,
"max_completion_tokens": 1024
}
```
### 3. ⚡ Light-Eval
LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.
**Evaluation Tasks and Objectives**
LightEval assesses model capabilities using the following six core tasks:
- MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
- TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
- Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
- Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
- GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
- ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.
**Overall Score Calculation**
The overall performance score of a model is computed using the average of the six evaluation tasks:
LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6
To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval
### 4. 🐍 Snake-Eval
An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions.
Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating.
This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.
**Sample Prompt:**
```
You are controlling a snake in a multi-apple Snake game. The board size is 10x10.
Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.
Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)
Your snake ID: 1 which is currently positioned at (5, 1)
Enemy snakes positions:
* Snake #2 is at position (7, 1) with body at []
Board state:
9 . . . . . A . . . .
8 . . . . . . . . . .
7 . A . . . . . . . A
6 . . . . . . . . . A
5 . . . . . . . . . .
4 . . . . . . . . . .
3 . . . . . . . . . .
2 A . . . . . . . . .
1 . . . . . 1 . 2 . .
0 . . . . . . . . . .
0 1 2 3 4 5 6 7 8 9
--Your last move information:--
Direction: LEFT
Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1).
Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
--End of your last move information.--
Rules:<
1) If you move onto an apple, you grow and gain 1 point.
2) If you hit a wall, another snake, or yourself, you die.
3) The goal is to have the most points by the end.
Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
Decreasing y coordinate: DOWN, increasing y coordinate: UP
Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.
```
To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench
### 5. 📚 Retrieval
An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:
- Retrieve relevant information from a knowledge base
- Generate accurate and contextually appropriate responses
- Maintain coherence between retrieved information and generated text
- Handle real-world information retrieval scenarios
**Retrieval Metrics**
- **RAG Success Rate**: Percentage of successful retrievals
- **Maximum Correct References**: Upper limit for correct retrievals per query
- **Hallucinated References**: Number of irrelevant documents retrieved
- **Missed References**: Number of relevant documents not retrieved
**LLM Judge Evaluation Metrics**
- **Legal Reasoning**: Assesses the model's ability to understand and apply legal concepts
- **Factual Legal Accuracy**: Measures accuracy of legal facts and references
- **Clarity & Precision**: Evaluates clarity and precision of responses
- **Factual Reliability**: Checks for biases and factual accuracy
- **Fluency**: Assesses language fluency and coherence
- **Relevance**: Measures response relevance to the query
- **Content Safety**: Evaluates content safety and appropriateness
**Judge Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
**RAG Score Calculation**
The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:
**Formula Components:**
- **RAG Success Rate** (0.9 weight): Direct percentage of successful retrievals (higher is better)
- **Normalized False Positives** (0.9 weight): Hallucinated references, min-max normalized (lower is better)
- **Normalized Max Correct References** (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)
- **Normalized Missed References** (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)
**Final Score Formula:**
```
RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives +
0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0
```
### 6. 👥 Human Arena
Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.
**Evaluation Methodology**
- **Head-to-Head Comparisons**: Models are presented with the same prompts and their responses are compared by human evaluators
- **ELO Rating System**: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
- **Community Voting**: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
- **Blind Evaluation**: Evaluators see responses without knowing which model generated them, reducing bias
**Key Metrics**
- **ELO Rating**: Overall skill level based on tournament-style matchups (higher is better)
- **Win Rate**: Percentage of head-to-head victories against other models
- **Wins/Losses/Ties**: Direct comparison statistics showing model performance
- **Total Games**: Number of evaluation rounds completed
- **Votes**: Community engagement and evaluation volume
- **Provider & Technical Details**: Infrastructure and model configuration information
**Evaluation Criteria**
Human evaluators consider multiple factors when comparing model responses:
- **Response Quality**: Accuracy, completeness, and relevance of answers
- **Communication Style**: Clarity, coherence, and appropriateness of language
- **Helpfulness**: How well the response addresses the user's needs
- **Safety & Ethics**: Adherence to safety guidelines and ethical considerations
- **Creativity & Originality**: For tasks requiring creative or innovative thinking
Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.
"""
EVALUATION_QUEUE_TEXT = """
<h2 align="center">Model Evaluation Steps</h2>
### Evaluation Process:
1. **Select Evaluation Type**
- Choose the category you want to evaluate from the dropdown menu above
- Each category measures different capabilities (Arena, Hybrid-Benchmark, LM-Harness, etc.)
2. **Enter Parameters**
- Input required parameters such as model name, dataset, number of samples
- Adjust generation parameters like temperature and maximum tokens
- Optionally enable MLflow or Weights & Biases integration
3. **Start Evaluation**
- Click the "Submit Eval" button to start the evaluation
- Your submission will be added to the "Pending Evaluation Queue"
4. **Monitor Evaluation**
- When the evaluation starts, it will appear in the "Running Evaluation Queue"
- Once completed, it will move to the "Finished Evaluations" section
5. **View Results**
- After the evaluation is complete, results will be displayed in the "LLM Benchmark" tab
- You can compare your model's performance with other models
### Important Limitations:
- The model repository must be a maximum of 750 MB in size
- For trained adapters, the maximum LoRA rank must be 32
"""
CITATION_BUTTON_LABEL = "Cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{newmind-mezura,
author = {Newmind AI},
title = {Newmind AI LLM Evaluation Leaderboard},
year = {2025},
publisher = {Newmind AI},
howpublished = "\url{https://huggingface.co/spaces/newmindai/Mezura}"
}
"""
|