OpenEvals
AI & ML interests
LLM evaluation
Recent Activity
Articles
Hi! Welcome on the org page of the Evaluation team at HuggingFace. We want to support the community in building and sharing quality evaluations, for reproducible and fair model comparisions, to cut through the hype of releases and better understand actual model capabilities.
We're behind the:
- lighteval LLM evaluation suite, fast and filled with the SOTA benchmarks you might want
- evaluation guidebook, your reference for LLM evals
- leaderboards on the hub initiative, to encourage people to build more leaderboards in the open for more reproducible evaluation. You'll find some doc here to build your own, and you can look for the best leaderboard for your use case here!
Our archived projects:
- Open LLM Leaderboard (over 11K models evaluated since 2023)
We're not behind the evaluate metrics guide but if you want to understand metrics better we really recommend checking it out!
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 241 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 249 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
-
Find a leaderboard
π118Explore and discover all leaderboards from the HF community
-
YourBench
π42Generate custom evaluations from your data easily!
-
Example Leaderboard Template
π₯16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 241 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 249 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
-
Find a leaderboard
π118Explore and discover all leaderboards from the HF community
-
YourBench
π42Generate custom evaluations from your data easily!
-
Example Leaderboard Template
π₯16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
spaces
7
Benchmark Finder
A space to view and inspect all the tasks in lighteval
Find a leaderboard
Explore and discover all leaderboards from the HF community
Aa Omniscience
Display and inspect log files
InferenceProviderTestingBackend
Launch and monitor model evaluation jobs
Evals
Run your LLM evaluations on the hub
Generate a command to run model evaluations