Spaces:

newmindai
/

Mezura

Running

App Files Files Community

Mezura / src /display /about.py

nmmursit

Fixed text wrapping in tables & Added Contact Info

1326dcc verified 15 days ago

raw

history blame contribute delete

14.9 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Tasks for different evaluation categories
	class Tasks(Enum):
	# Hybrid-Benchmark Tasks (model_evaluation_mlflow_result.json metrikleri)
	accuracy = Task("OpenAI-Scores", "accuracy", "Accuracy")
	relevance = Task("OpenAI-Scores", "relevance", "Relevance")
	coherence = Task("OpenAI-Scores", "coherence", "Coherence")

	# BLEU Score
	bleu_mean = Task("BLEU", "mean", "BLEU Mean")

	# ROUGE Scores
	rouge1_mean = Task("ROUGE", "rouge1_mean", "ROUGE-1")
	rouge2_mean = Task("ROUGE", "rouge2_mean", "ROUGE-2")
	rougeL_mean = Task("ROUGE", "rougeL_mean", "ROUGE-L")

	# BERT Score
	bert_precision = Task("BERT-Score", "precision_mean", "BERT Precision")
	bert_recall = Task("BERT-Score", "recall_mean", "BERT Recall")
	bert_f1 = Task("BERT-Score", "f1_mean", "BERT F1")

	# Cosine Similarity
	cosine_turkish = Task("Cosine-Similarity", "turkish_mean", "Turkish Similarity")
	cosine_multilingual = Task("Cosine-Similarity", "multilingual_mean", "Multilingual Similarity")

	# Evaluation Metrics
	total_samples = Task("Evaluation-Metrics", "total_samples", "Total Samples")
	avg_input_length = Task("Evaluation-Metrics", "avg_input_length", "Avg Input Length")
	avg_prediction_length = Task("Evaluation-Metrics", "avg_prediction_length", "Avg Prediction Length")
	avg_reference_length = Task("Evaluation-Metrics", "avg_reference_length", "Avg Reference Length")

	# Leaderboard title
	TITLE = """<h1 align="center" id="space-title">Newmind AI LLM Evaluation Leaderboard</h1>"""

	# Introduction text explaining the leaderboard
	INTRODUCTION_TEXT = """
	Evaluate your model's performance in the following categories:

	1. ⚔️ Auto Arena - Tournament-style evaluation where models are directly compared and ranked using an ELO rating system.

	2. 👥 Human Arena - Comparative evaluation based on human preferences, assessed by a reviewer group.

	3. 📚 Retrieval - Evaluation focused on information retrieval and text generation quality for real-world applications.

	4. ⚡ Light Eval - Fast and efficient model evaluation framework for quick testing.

	5. 🔄 EvalMix - Multi-dimensional evaluation including lexical accuracy and semantic coherence.

	6. 🐍 Snake Bench - Specialized evaluation measuring step-by-step problem solving and complex reasoning abilities.

	7. 🧩 Structured Outputs - Coming soon!

	Evaluate your model in any or all of these categories to discover its capabilities and areas of excellence.

	For any questions, please contact us at info@newmind.ai
	"""

	# Detailed explanation of benchmarks and reproduction steps
	LLM_BENCHMARKS_TEXT = """
	<h2 align="center">Evaluation Categories</h2>

	### 1. ⚔️ Arena-Hard-Auto: Competitive Benchmarking at Scale
	Arena-Hard-Auto is a cutting-edge automatic evaluation framework tailored for instruction-tuned Large Language Models (LLMs). Leveraging a tournament-style evaluation methodology, it pits models against each other in head-to-head matchups, with performance rankings determined via the Elo rating system—a method proven to align closely with human judgment, as evidenced in Chatbot Arena benchmarks.
	This evaluation suite is grounded in real-world use cases, benchmarking models across 11 diverse legal tasks, encompassing an extensive Turkish legal question-answer pairs.

	Key Evaluation Pillars
	- Automated Judging
	Relies on specialized judge models to assess and determine the winner in each model-versus-model comparison.
	Includes dynamic system prompt adaptation to ensure context-aware evaluation based on the specific domain of the query.
	- Win Probability Estimation
	Computes the probability of victory for each model in a matchup using a logistic regression model, offering a probabilistic understanding of comparative strength.
	- Skill Rating (Elo Score)
	Utilizes the Elo rating system to provide a robust measurement of each model’s skill level relative to its competitors, ensuring a competitive and evolving leaderboard.
	- Win Rate
	Measures a model’s dominance by calculating the proportion of head-to-head victories, serving as a direct indicator of real-world performance across the benchmark suite.
	Arena-Hard-Auto offers a fast, scalable, and reliable method to benchmark LLMs, combining quantitative rigor with realistic interaction scenarios—making it an indispensable tool for understanding model capabilities in high-stakes legal and instructional domains.

	To reproduce the results, you can use this repository: https://github.com/lmarena/arena-hard-auto

	### 2. 🔄 EvalMix
	EvalMix is a comprehensive evaluation pipeline designed to assess the performance of language models across multiple dimensions. This hybrid evaluation tool automatically analyzes model outputs, computes various semantic, LLM-based, and lexical metrics, and visualizes the results. EvalMix offers the following features:

	### Comprehensive Evaluation Metrics

	* LLM-as-a-Judge: Uses large language models—primarily GPT variants—to evaluate the accuracy, coherence, and relevance of generated responses.
	* Lexical Metrics: Calculates traditional NLP metrics such as BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, along with modern metrics like BERTScore (precision, recall, F1) and cosine similarity.
	* Comparative Analysis: Enables performance comparison between multiple models on the same dataset.
	* Cosine Similarity (Turkish): Assesses performance in the Turkish language using Turkish-specific embedding models.
	* Cosine Similarity (Multilingual): Measures multilingual performance using language-agnostic embeddings.

	### Generation Configuration for Evaluation

	The following configuration parameters are used for model generation during evaluation:

	```json
	{
	"num_samples": 1100,
	"random_seed": 42,
	"temperature": 0.0,
	"max_completion_tokens": 1024
	}
	```

	### 3. ⚡ Light-Eval
	LightEval is a fast and modular framework designed to evaluate Large Language Models (LLMs) across a diverse range of tasks. It provides a comprehensive performance analysis by benchmarking models on academic, logical, scientific, and mathematical reasoning challenges.

	Evaluation Tasks and Objectives
	LightEval assesses model capabilities using the following six core tasks:
	- MMLU (5-shot): Evaluates general knowledge and reasoning skills across academic and professional disciplines.(Proffesional-law task only)
	- TruthfulQA (0-shot): Measures the model's ability to generate accurate and truthful responses.
	- Winogrande (5-shot): Tests commonsense reasoning and logical inference abilities.
	- Hellaswag (10-shot): Assesses the coherence and logical consistency of model predictions based on contextual cues.
	- GSM8k (5-shot): Evaluates step-by-step mathematical reasoning and problem-solving capabilities.
	- ARC (25-shot): Tests scientific reasoning and the ability to solve science-based problems.

	Overall Score Calculation
	The overall performance score of a model is computed using the average of the six evaluation tasks:
	LightEval Overall Score = (MMLU_proffesional_law + TruthfulQA + Winogrande + Hellaswag + GSM8k + ARC) / 6

	To reproduce the results, you can use this repository: https://github.com/huggingface/lighteval

	### 4. 🐍 Snake-Eval
	An evaluation framework where models play the classic Snake game, competing to collect apples while avoiding collisions.
	Starting from random positions, models must guide their snakes using step-by-step reasoning, with performance measured as an Elo rating.
	This tests problem-solving ability, spatial awareness, and logical thinking in a challenging environment.

	Sample Prompt:
	```
	You are controlling a snake in a multi-apple Snake game. The board size is 10x10.
	Normal X,Y coordinates are used. Coordinates range from (0,0) at bottom left to (9,9) at top right.

	Apples at: (9, 6), (0, 2), (5, 9), (1, 7), (9, 7)

	Your snake ID: 1 which is currently positioned at (5, 1)

	Enemy snakes positions:
	* Snake #2 is at position (7, 1) with body at []

	Board state:
	9 . . . . . A . . . .
	8 . . . . . . . . . .
	7 . A . . . . . . . A
	6 . . . . . . . . . A
	5 . . . . . . . . . .
	4 . . . . . . . . . .
	3 . . . . . . . . . .
	2 A . . . . . . . . .
	1 . . . . . 1 . 2 . .
	0 . . . . . . . . . .
	0 1 2 3 4 5 6 7 8 9

	--Your last move information:--
	Direction: LEFT
	Rationale: I'm noticing that (0,2) is the closest apple from our head at (6,1).
	Moving LEFT starts guiding us toward this apple while avoiding the enemy snake.
	Strategy: Continue left and then maneuver upward to reach the apple at (0,2).
	--End of your last move information.--

	Rules:<
	1) If you move onto an apple, you grow and gain 1 point.
	2) If you hit a wall, another snake, or yourself, you die.
	3) The goal is to have the most points by the end.

	Decreasing x coordinate: LEFT, increasing x coordinate: RIGHT
	Decreasing y coordinate: DOWN, increasing y coordinate: UP

	Provide your reasoning and end with: UP, DOWN, LEFT, or RIGHT.
	```

	To reproduce the results, you can use this repository: https://github.com/gkamradt/SnakeBench

	### 5. 📚 Retrieval
	An evaluation system designed to assess Retrieval-Augmented Generation (RAG) capabilities. It measures how well models can:
	- Retrieve relevant information from a knowledge base
	- Generate accurate and contextually appropriate responses
	- Maintain coherence between retrieved information and generated text
	- Handle real-world information retrieval scenarios

	Retrieval Metrics
	- RAG Success Rate: Percentage of successful retrievals
	- Maximum Correct References: Upper limit for correct retrievals per query
	- Hallucinated References: Number of irrelevant documents retrieved
	- Missed References: Number of relevant documents not retrieved

	LLM Judge Evaluation Metrics
	- Legal Reasoning: Assesses the model's ability to understand and apply legal concepts
	- Factual Legal Accuracy: Measures accuracy of legal facts and references
	- Clarity & Precision: Evaluates clarity and precision of responses
	- Factual Reliability: Checks for biases and factual accuracy
	- Fluency: Assesses language fluency and coherence
	- Relevance: Measures response relevance to the query
	- Content Safety: Evaluates content safety and appropriateness

	Judge Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

	RAG Score Calculation
	The RAG Score is a comprehensive metric that combines multiple performance indicators using dynamic normalization across all models. The formula weights different aspects of retrieval performance:

	Formula Components:
	- RAG Success Rate (0.9 weight): Direct percentage of successful retrievals (higher is better)
	- Normalized False Positives (0.9 weight): Hallucinated references, min-max normalized (lower is better)
	- Normalized Max Correct References (0.1 weight): Maximum correct retrievals, min-max normalized (higher is better)
	- Normalized Missed References (0.1 weight): Relevant documents not retrieved, min-max normalized (lower is better)

	Final Score Formula:
	```
	RAG Score = (0.9 × RAG_success_rate + 0.9 × norm_false_positives +
	0.1 × norm_max_correct + 0.1 × norm_missed_refs) ÷ 2.0
	```

	### 6. 👥 Human Arena
	Human Arena is a community-driven evaluation platform where language models are compared through human preferences and voting. This evaluation method captures real-world user preferences and provides insights into model performance from a human perspective.

	Evaluation Methodology
	- Head-to-Head Comparisons: Models are presented with the same prompts and their responses are compared by human evaluators
	- ELO Rating System: Similar to chess rankings, models gain or lose rating points based on wins, losses, and ties against other models
	- Community Voting: Real users vote on which model responses they prefer, ensuring diverse evaluation perspectives
	- Blind Evaluation: Evaluators see responses without knowing which model generated them, reducing bias

	Key Metrics
	- ELO Rating: Overall skill level based on tournament-style matchups (higher is better)
	- Win Rate: Percentage of head-to-head victories against other models
	- Wins/Losses/Ties: Direct comparison statistics showing model performance
	- Total Games: Number of evaluation rounds completed
	- Votes: Community engagement and evaluation volume
	- Provider & Technical Details: Infrastructure and model configuration information

	Evaluation Criteria
	Human evaluators consider multiple factors when comparing model responses:
	- Response Quality: Accuracy, completeness, and relevance of answers
	- Communication Style: Clarity, coherence, and appropriateness of language
	- Helpfulness: How well the response addresses the user's needs
	- Safety & Ethics: Adherence to safety guidelines and ethical considerations
	- Creativity & Originality: For tasks requiring creative or innovative thinking

	Human Arena provides a complementary perspective to automated benchmarks, capturing nuanced human preferences that traditional metrics might miss. This evaluation is particularly valuable for understanding how models perform in real-world conversational scenarios.

	"""

	EVALUATION_QUEUE_TEXT = """
	<h2 align="center">Model Evaluation Steps</h2>

	### Evaluation Process:

	1. Select Evaluation Type
	- Choose the category you want to evaluate from the dropdown menu above
	- Each category measures different capabilities (Arena, Hybrid-Benchmark, LM-Harness, etc.)

	2. Enter Parameters
	- Input required parameters such as model name, dataset, number of samples
	- Adjust generation parameters like temperature and maximum tokens
	- Optionally enable MLflow or Weights & Biases integration

	3. Start Evaluation
	- Click the "Submit Eval" button to start the evaluation
	- Your submission will be added to the "Pending Evaluation Queue"

	4. Monitor Evaluation
	- When the evaluation starts, it will appear in the "Running Evaluation Queue"
	- Once completed, it will move to the "Finished Evaluations" section

	5. View Results
	- After the evaluation is complete, results will be displayed in the "LLM Benchmark" tab
	- You can compare your model's performance with other models

	### Important Limitations:
	- The model repository must be a maximum of 750 MB in size
	- For trained adapters, the maximum LoRA rank must be 32
	"""

	CITATION_BUTTON_LABEL = "Cite these results"

	CITATION_BUTTON_TEXT = r"""
	@misc{newmind-mezura,
	author = {Newmind AI},
	title = {Newmind AI LLM Evaluation Leaderboard},
	year = {2025},
	publisher = {Newmind AI},
	howpublished = "\url{https://huggingface.co/spaces/newmindai/Mezura}"
	}
	"""