saumyamalik's picture
update paper link
4a49ee3
from datetime import datetime
import pytz
ABOUT_TEXT_V2 = """
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
1. **Factuality** (*NEW!*): Tests the ability of RMs to detect hallucinations and other basic errors in completions.
2. **Precise Instruction Following** (*NEW!*): Tests the ability of RMs to judge whether text follows precise instructions, such as "Answer without the letter u".
3. **Math**: Tests RMs' abilities at math, on open-ended human prompts ranging from middle school physics and geometry to college-level chemistry, calculus, combinatorics, and more.
4. **Safety**: Tests RMs' abilities to correctly comply with or refuse prompts related to harmful use cases as well as general compliance behaviors.
5. **Focus**: Tests RMs' ability to detect high-quality, on-topic answers to general user queries.
6. **Ties** (*NEW*!): This new type of subset tests the robustness of RMs in domains with many possible similar answers. For example, the question "Name a color of the rainbow" has seven possible correct answers and infinitely many incorrect ones.
The RewardBench 2 leaderboard averages over these six subsets.
For the first five categories, the scoring for RewardBench 2 evaluates success as whether the score of a prompt-chosen pair is greater than the score of *three* prompt-rejected pairs.
The "Ties" score is a weighted score of accuracy (as measured by *all* valid correct answers being scored higher than *all* incorrect answers) and whether the reward margin between correct and incorrect answers exceeds that of the highest and lowest-scored correct responses. This metric rewards not only correctness, but also a model's ability to prioritize correct answers over incorrect ones more strongly than it distinguishes between equally valid correct responses.
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
## Dataset Construction Summary
| Domain | Count | Prompt Source | Method of generating completions | Completion Filtering |
|--------|-------|---------------|----------------------------------|---------------------|
| Factuality | 475 | Human | Both | Multi-LM-as-a-judge |
| Precise IF | 160 | Human | Natural | Verifier functions |
| Math | 183 | Human | Natural | Majority voting |
| Safety | 450 | CoCoNot | Both | LM-as-a-judge & rubrics |
| Focus | 495 | Human | System Prompt Variation | N/A |
| Ties | 102 | Manual | System Prompt Variation | Manual verification |
## Dataset Details
Each sample in the dataset has the following items.
Note, the dataset is single-turn:
* `prompt` (`str`): the instruction given in the various test sets.
* `chosen` (`list[str]`): the chosen response(s) (1 chosen response for all subsets but ties)
* `rejected` (`list[str]`): the rejected responses (3 chosen responses for all subsets but ties)
* `num_correct` (`int`): the number of chosen responses
* `num_rejected` (`int`): the number of rejected responses
* `total_completions` (`int`): the total number of responses
* `models` (`list[str]`): a list of models that the chosen and rejected responses are generated from, respectively
* `subset` (`str`): the subset the datapoint is part of.
* `id` (`int`): an incremented id for every prompt in the benchmark.
To select a specific subset use HuggingFace Datasets `.filter` functionality.
```
dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality")
```
## Models Used
We generated completions from the following models:
- [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) (Apache 2.0)
- [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) (Llama 3.1 Community License Agreement)
- [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) (Llama 3.1 Community License Agreement)
- [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (Llama 3.1 Community License Agreement)
- [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Llama 3.1 Community License Agreement)
- [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) (Llama 3.2 Community License Agreement)
- [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) (Llama 2 Community License Agreement)
- [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b) (Ai2 ImpACT Low Risk License)
- [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) (Qwen License Agreement)
- [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (Apache 2.0)
- [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache 2.0)
- [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (Apache 2.0)
- [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) (Qwen License Agreement)
- [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) (Apache 2.0)
- [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) (This model is licensed under the Deepseek License. Any use of the outputs from this model must be in accordance with the use restrictions in the [Deepseek License](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL).)
- [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924) (Apache 2.0)
- [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b) (Apache 2.0)
- [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) (MIT License)
- GPT-4o (Outputs produced by GPT-4 are subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use/))
- Claude 3.5 Sonnet (Outputs produced by Claude are subject to Anthropic [terms of service](https://www.anthropic.com/legal/consumer-terms) and [usage policy](https://www.anthropic.com/legal/aup))
## License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use.
## Trained Reward Models
We also trained and released several reward models— check out the [RewardBench 2 Collection](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51) to use them!
"""
ABOUT_TEXT_V1 = """
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
A win is when the score for the chosen response is higher than the score for the rejected response.
Note: Models with (*) after the model name are independently submitted model scores which have not been verified by the RewardBench team.
## Overview
We average over 4 core sections (per prompt weighting):
1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
4. **Reasoning**: Includes the code and math subsets (math-prm, hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
For Reasoning, we increase the weight of the PRM-Math subset so code and math abilities are weighed equally in the final number, rather than increasing the relevance of code.
We add a final column, **Prior Sets** -- includes the test sets ([anthropic_helpful](https://huggingface.co/datasets/Anthropic/hh-rlhf), [anthropic_hhh](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment), [shp](https://huggingface.co/datasets/stanfordnlp/SHP), [summarize](https://huggingface.co/datasets/openai/summarize_from_feedback))
Prior sets is weighted 0.5x in the final score to avoid gamification by training on the available training sets of Anthropic HH, SHP, and Summarize.
Once all subsets weighted averages are achieved, the final RewardBench score is the average across the 5 subset scores.
We include multiple types of reward models in this evaluation:
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed. *Note*: This also includes other models trained with implicit rewards, such as those trained with [KTO](https://arxiv.org/abs/2402.01306).
4. **Random**: Random choice baseline.
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
*Note*: The reference models for DPO models (and other implicit rewards) can be found in two ways.
* Click on a specific model in results and you'll see a key `ref_model`, e.g. [Qwen](https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set/Qwen/Qwen1.5-72B-Chat.json).
* All the reference models are listed in the [evaluation configs](https://github.com/allenai/reward-bench/blob/main/scripts/configs/eval_configs.yaml).
### Subset Details
Total number of the prompts is: 2985, filtered from 5123.
| Subset | Num. Samples (Pre-filtering, post-filtering) | Description |
| :---------- | :-----: | :---------: |
| alpacaeval-easy | 805, 100 | Great model vs poor model |
| alpacaeval-length | 805, 95 | Good model vs low model, equal length |
| alpacaeval-hard | 805, 95 | Great model vs baseline model |
| mt-bench-easy | 28, 28 | MT Bench 10s vs 1s |
| mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s |
| mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 |
| refusals-dangerous | 505, 100 | Dangerous response vs no response |
| refusals-offensive | 704, 100 | Offensive response vs no response |
| llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
| llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
| llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
| llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
| llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
| xstest-should-refuse | 450, 154 | False response dataset (see [paper](https://arxiv.org/abs/2308.01263)) |
| xstest-should-respond | 450, 250 | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) |
| do not answer | 939, 136 | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) |
| math-prm | 447 | Human references vs. model error from OpenAI's Let's Verify Step by Step |
| hep-cpp | 164 | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) |
| hep-go | 164 | Go code |
| hep-java | 164 | Java code |
| hep-js | 164 | Javascript code |
| hep-python | 164 | Python code |
| hep-rust | 164 | Rust code |
Lengths (mean, std. dev.) include the prompt
| subset | length bias | chosen_chars | rejected_chars | chosen_tokens | rejected_tokens | chosen_unique_tokens | rejected_unique_tokens |
|-----------------------|-------------|----------------|------------------|-----------------|-------------------|------------------------|--------------------------|
| alpacaeval-easy | True | 2283 (1138) | 646 (482) | 591 (303) | 167 (139) | 253 (117) | 83 (46) |
| alpacaeval-hard | True | 1590 (769) | 526 (430) | 412 (199) | 137 (117) | 173 (67) | 71 (48) |
| alpacaeval-length | Neutral | 2001 (1137) | 2127 (1787) | 511 (283) | 597 (530) | 192 (85) | 189 (99) |
| donotanswer | False | 755 (722) | 1389 (695) | 170 (161) | 320 (164) | 104 (82) | 157 (73) |
| hep-cpp | Neutral | 709 (341) | 705 (342) | 261 (125) | 259 (125) | 100 (29) | 99 (29) |
| hep-go | Neutral | 738 (361) | 734 (361) | 266 (118) | 265 (118) | 100 (29) | 99 (29) |
| hep-java | Neutral | 821 (393) | 814 (390) | 263 (123) | 261 (122) | 102 (30) | 102 (30) |
| hep-js | Neutral | 677 (341) | 673 (339) | 251 (129) | 250 (128) | 93 (29) | 93 (29) |
| hep-python | Neutral | 618 (301) | 616 (300) | 212 (98) | 211 (98) | 86 (26) | 85 (26) |
| hep-rust | Neutral | 666 (391) | 660 (391) | 221 (132) | 219 (132) | 95 (29) | 95 (29) |
| llmbar-adver-GPTInst | False | 735 (578) | 1623 (1055) | 170 (135) | 377 (245) | 93 (59) | 179 (106) |
| llmbar-adver-GPTOut | Neutral | 378 (339) | 359 (319) | 96 (81) | 101 (94) | 60 (45) | 55 (41) |
| llmbar-adver-manual | False | 666 (584) | 1139 (866) | 160 (134) | 264 (194) | 92 (63) | 140 (90) |
| llmbar-adver-neighbor | False | 287 (297) | 712 (749) | 70 (76) | 173 (175) | 43 (31) | 91 (70) |
| llmbar-natural | Neutral | 553 (644) | 530 (597) | 139 (162) | 130 (140) | 75 (71) | 70 (62) |
| mt-bench-easy | False | 1563 (720) | 2129 (1520) | 377 (159) | 551 (415) | 166 (55) | 116 (62) |
| mt-bench-hard | False | 1225 (499) | 1471 (1016) | 284 (116) | 349 (234) | 131 (45) | 136 (58) |
| mt-bench-med | Neutral | 1558 (729) | 1733 (1312) | 377 (170) | 410 (311) | 162 (58) | 145 (88) |
| refusals-dangerous | False | 597 (81) | 1828 (547) | 131 (20) | 459 (136) | 90 (12) | 211 (50) |
| refusals-offensive | False | 365 (116) | 1092 (1146) | 82 (25) | 299 (278) | 64 (15) | 134 (101) |
| xstest-should-refuse | False | 584 (419) | 904 (493) | 129 (89) | 217 (115) | 81 (47) | 116 (53) |
| xstest-should-respond | True | 771 (420) | 466 (427) | 189 (105) | 107 (94) | 104 (48) | 67 (48) |
For more details, see the [dataset](https://huggingface.co/datasets/allenai/reward-bench).
"""
# Get Pacific time zone (handles PST/PDT automatically)
pacific_tz = pytz.timezone("America/Los_Angeles")
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
TOP_TEXT = """# RewardBench: Evaluating Reward Models"""
CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult!
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper](https://arxiv.org/abs/2506.01937) | Total models: {{}} | Last restart (PST): {current_time}"""
CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation.
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v1](https://huggingface.co/datasets/allenai/reward-bench) | [Prior Test Sets](https://huggingface.co/datasets/allenai/pref-test-sets) | [Results v1](https://huggingface.co/datasets/allenai/reward-bench-results) | [Paper v1](https://arxiv.org/abs/2403.13787) | Total models: {{}} | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): {current_time}
**Note**: This leaderboard is frozen and will not be updated. The final version of the evaluation results are available in the source for this application.
⚠️ Many of the top models were trained on unintentionally contaminated, AI-generated data, for more information, see this [gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300).
"""