Spaces:
Running
Running
from datetime import datetime | |
import pytz | |
ABOUT_TEXT_V2 = """ | |
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories: | |
1. **Factuality** (*NEW!*): Tests the ability of RMs to detect hallucinations and other basic errors in completions. | |
2. **Precise Instruction Following** (*NEW!*): Tests the ability of RMs to judge whether text follows precise instructions, such as "Answer without the letter u". | |
3. **Math**: Tests RMs' abilities at math, on open-ended human prompts ranging from middle school physics and geometry to college-level chemistry, calculus, combinatorics, and more. | |
4. **Safety**: Tests RMs' abilities to correctly comply with or refuse prompts related to harmful use cases as well as general compliance behaviors. | |
5. **Focus**: Tests RMs' ability to detect high-quality, on-topic answers to general user queries. | |
6. **Ties** (*NEW*!): This new type of subset tests the robustness of RMs in domains with many possible similar answers. For example, the question "Name a color of the rainbow" has seven possible correct answers and infinitely many incorrect ones. | |
The RewardBench 2 leaderboard averages over these six subsets. | |
For the first five categories, the scoring for RewardBench 2 evaluates success as whether the score of a prompt-chosen pair is greater than the score of *three* prompt-rejected pairs. | |
The "Ties" score is a weighted score of accuracy (as measured by *all* valid correct answers being scored higher than *all* incorrect answers) and whether the reward margin between correct and incorrect answers exceeds that of the highest and lowest-scored correct responses. This metric rewards not only correctness, but also a model's ability to prioritize correct answers over incorrect ones more strongly than it distinguishes between equally valid correct responses. | |
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> | |
## Dataset Construction Summary | |
| Domain | Count | Prompt Source | Method of generating completions | Completion Filtering | | |
|--------|-------|---------------|----------------------------------|---------------------| | |
| Factuality | 475 | Human | Both | Multi-LM-as-a-judge | | |
| Precise IF | 160 | Human | Natural | Verifier functions | | |
| Math | 183 | Human | Natural | Majority voting | | |
| Safety | 450 | CoCoNot | Both | LM-as-a-judge & rubrics | | |
| Focus | 495 | Human | System Prompt Variation | N/A | | |
| Ties | 102 | Manual | System Prompt Variation | Manual verification | | |
## Dataset Details | |
Each sample in the dataset has the following items. | |
Note, the dataset is single-turn: | |
* `prompt` (`str`): the instruction given in the various test sets. | |
* `chosen` (`list[str]`): the chosen response(s) (1 chosen response for all subsets but ties) | |
* `rejected` (`list[str]`): the rejected responses (3 chosen responses for all subsets but ties) | |
* `num_correct` (`int`): the number of chosen responses | |
* `num_rejected` (`int`): the number of rejected responses | |
* `total_completions` (`int`): the total number of responses | |
* `models` (`list[str]`): a list of models that the chosen and rejected responses are generated from, respectively | |
* `subset` (`str`): the subset the datapoint is part of. | |
* `id` (`int`): an incremented id for every prompt in the benchmark. | |
To select a specific subset use HuggingFace Datasets `.filter` functionality. | |
``` | |
dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality") | |
``` | |
## Models Used | |
We generated completions from the following models: | |
- [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) (Apache 2.0) | |
- [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) (Llama 3.1 Community License Agreement) | |
- [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) (Llama 3.1 Community License Agreement) | |
- [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (Llama 3.1 Community License Agreement) | |
- [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Llama 3.1 Community License Agreement) | |
- [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) (Llama 3.2 Community License Agreement) | |
- [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) (Llama 2 Community License Agreement) | |
- [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b) (Ai2 ImpACT Low Risk License) | |
- [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) (Qwen License Agreement) | |
- [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (Apache 2.0) | |
- [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache 2.0) | |
- [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (Apache 2.0) | |
- [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) (Qwen License Agreement) | |
- [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) (Apache 2.0) | |
- [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) (This model is licensed under the Deepseek License. Any use of the outputs from this model must be in accordance with the use restrictions in the [Deepseek License](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL).) | |
- [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924) (Apache 2.0) | |
- [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b) (Apache 2.0) | |
- [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) (MIT License) | |
- GPT-4o (Outputs produced by GPT-4 are subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use/)) | |
- Claude 3.5 Sonnet (Outputs produced by Claude are subject to Anthropic [terms of service](https://www.anthropic.com/legal/consumer-terms) and [usage policy](https://www.anthropic.com/legal/aup)) | |
## License | |
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use. | |
## Trained Reward Models | |
We also trained and released several reward models— check out the [RewardBench 2 Collection](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51) to use them! | |
""" | |
ABOUT_TEXT_V1 = """ | |
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt. | |
A win is when the score for the chosen response is higher than the score for the rejected response. | |
Note: Models with (*) after the model name are independently submitted model scores which have not been verified by the RewardBench team. | |
## Overview | |
We average over 4 core sections (per prompt weighting): | |
1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium) | |
2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual) | |
3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer) | |
4. **Reasoning**: Includes the code and math subsets (math-prm, hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust) | |
For Reasoning, we increase the weight of the PRM-Math subset so code and math abilities are weighed equally in the final number, rather than increasing the relevance of code. | |
We add a final column, **Prior Sets** -- includes the test sets ([anthropic_helpful](https://huggingface.co/datasets/Anthropic/hh-rlhf), [anthropic_hhh](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment), [shp](https://huggingface.co/datasets/stanfordnlp/SHP), [summarize](https://huggingface.co/datasets/openai/summarize_from_feedback)) | |
Prior sets is weighted 0.5x in the final score to avoid gamification by training on the available training sets of Anthropic HH, SHP, and Summarize. | |
Once all subsets weighted averages are achieved, the final RewardBench score is the average across the 5 subset scores. | |
We include multiple types of reward models in this evaluation: | |
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score. | |
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP). | |
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed. *Note*: This also includes other models trained with implicit rewards, such as those trained with [KTO](https://arxiv.org/abs/2402.01306). | |
4. **Random**: Random choice baseline. | |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval. | |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32. | |
*Note*: The reference models for DPO models (and other implicit rewards) can be found in two ways. | |
* Click on a specific model in results and you'll see a key `ref_model`, e.g. [Qwen](https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set/Qwen/Qwen1.5-72B-Chat.json). | |
* All the reference models are listed in the [evaluation configs](https://github.com/allenai/reward-bench/blob/main/scripts/configs/eval_configs.yaml). | |
### Subset Details | |
Total number of the prompts is: 2985, filtered from 5123. | |
| Subset | Num. Samples (Pre-filtering, post-filtering) | Description | | |
| :---------- | :-----: | :---------: | | |
| alpacaeval-easy | 805, 100 | Great model vs poor model | | |
| alpacaeval-length | 805, 95 | Good model vs low model, equal length | | |
| alpacaeval-hard | 805, 95 | Great model vs baseline model | | |
| mt-bench-easy | 28, 28 | MT Bench 10s vs 1s | | |
| mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s | | |
| mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 | | |
| refusals-dangerous | 505, 100 | Dangerous response vs no response | | |
| refusals-offensive | 704, 100 | Offensive response vs no response | | |
| llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs | | |
| llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response | | |
| llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response | | |
| llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses | | |
| llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected | | |
| xstest-should-refuse | 450, 154 | False response dataset (see [paper](https://arxiv.org/abs/2308.01263)) | | |
| xstest-should-respond | 450, 250 | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) | | |
| do not answer | 939, 136 | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) | | |
| math-prm | 447 | Human references vs. model error from OpenAI's Let's Verify Step by Step | | |
| hep-cpp | 164 | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) | | |
| hep-go | 164 | Go code | | |
| hep-java | 164 | Java code | | |
| hep-js | 164 | Javascript code | | |
| hep-python | 164 | Python code | | |
| hep-rust | 164 | Rust code | | |
Lengths (mean, std. dev.) include the prompt | |
| subset | length bias | chosen_chars | rejected_chars | chosen_tokens | rejected_tokens | chosen_unique_tokens | rejected_unique_tokens | | |
|-----------------------|-------------|----------------|------------------|-----------------|-------------------|------------------------|--------------------------| | |
| alpacaeval-easy | True | 2283 (1138) | 646 (482) | 591 (303) | 167 (139) | 253 (117) | 83 (46) | | |
| alpacaeval-hard | True | 1590 (769) | 526 (430) | 412 (199) | 137 (117) | 173 (67) | 71 (48) | | |
| alpacaeval-length | Neutral | 2001 (1137) | 2127 (1787) | 511 (283) | 597 (530) | 192 (85) | 189 (99) | | |
| donotanswer | False | 755 (722) | 1389 (695) | 170 (161) | 320 (164) | 104 (82) | 157 (73) | | |
| hep-cpp | Neutral | 709 (341) | 705 (342) | 261 (125) | 259 (125) | 100 (29) | 99 (29) | | |
| hep-go | Neutral | 738 (361) | 734 (361) | 266 (118) | 265 (118) | 100 (29) | 99 (29) | | |
| hep-java | Neutral | 821 (393) | 814 (390) | 263 (123) | 261 (122) | 102 (30) | 102 (30) | | |
| hep-js | Neutral | 677 (341) | 673 (339) | 251 (129) | 250 (128) | 93 (29) | 93 (29) | | |
| hep-python | Neutral | 618 (301) | 616 (300) | 212 (98) | 211 (98) | 86 (26) | 85 (26) | | |
| hep-rust | Neutral | 666 (391) | 660 (391) | 221 (132) | 219 (132) | 95 (29) | 95 (29) | | |
| llmbar-adver-GPTInst | False | 735 (578) | 1623 (1055) | 170 (135) | 377 (245) | 93 (59) | 179 (106) | | |
| llmbar-adver-GPTOut | Neutral | 378 (339) | 359 (319) | 96 (81) | 101 (94) | 60 (45) | 55 (41) | | |
| llmbar-adver-manual | False | 666 (584) | 1139 (866) | 160 (134) | 264 (194) | 92 (63) | 140 (90) | | |
| llmbar-adver-neighbor | False | 287 (297) | 712 (749) | 70 (76) | 173 (175) | 43 (31) | 91 (70) | | |
| llmbar-natural | Neutral | 553 (644) | 530 (597) | 139 (162) | 130 (140) | 75 (71) | 70 (62) | | |
| mt-bench-easy | False | 1563 (720) | 2129 (1520) | 377 (159) | 551 (415) | 166 (55) | 116 (62) | | |
| mt-bench-hard | False | 1225 (499) | 1471 (1016) | 284 (116) | 349 (234) | 131 (45) | 136 (58) | | |
| mt-bench-med | Neutral | 1558 (729) | 1733 (1312) | 377 (170) | 410 (311) | 162 (58) | 145 (88) | | |
| refusals-dangerous | False | 597 (81) | 1828 (547) | 131 (20) | 459 (136) | 90 (12) | 211 (50) | | |
| refusals-offensive | False | 365 (116) | 1092 (1146) | 82 (25) | 299 (278) | 64 (15) | 134 (101) | | |
| xstest-should-refuse | False | 584 (419) | 904 (493) | 129 (89) | 217 (115) | 81 (47) | 116 (53) | | |
| xstest-should-respond | True | 771 (420) | 466 (427) | 189 (105) | 107 (94) | 104 (48) | 67 (48) | | |
For more details, see the [dataset](https://huggingface.co/datasets/allenai/reward-bench). | |
""" | |
# Get Pacific time zone (handles PST/PDT automatically) | |
pacific_tz = pytz.timezone("America/Los_Angeles") | |
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y") | |
TOP_TEXT = """# RewardBench: Evaluating Reward Models""" | |
CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult! | |
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper](https://arxiv.org/abs/2506.01937) | Total models: {{}} | Last restart (PST): {current_time}""" | |
CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation. | |
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v1](https://huggingface.co/datasets/allenai/reward-bench) | [Prior Test Sets](https://huggingface.co/datasets/allenai/pref-test-sets) | [Results v1](https://huggingface.co/datasets/allenai/reward-bench-results) | [Paper v1](https://arxiv.org/abs/2403.13787) | Total models: {{}} | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): {current_time} | |
**Note**: This leaderboard is frozen and will not be updated. The final version of the evaluation results are available in the source for this application. | |
⚠️ Many of the top models were trained on unintentionally contaminated, AI-generated data, for more information, see this [gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300). | |
""" | |