File size: 2,739 Bytes
9d5b280 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# Task-name
### Paper
Title: `Measuring Massive Multitask Language Understanding`
Abstract: `https://arxiv.org/abs/2009.03300`
`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.`
Homepage: `https://github.com/hendrycks/test`
Note: The `Flan` variants are derived from [here](https://github.com/jasonwei20/flan-2), and as described in Appendix D.1 of [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416).
### Citation
```
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
```
### Groups, Tags, and Tasks
#### Groups
* `mmlu`: `Original multiple-choice MMLU benchmark`
* `mmlu_continuation`: `MMLU but with continuation prompts`
* `mmlu_generation`: `MMLU generation`
MMLU is the original benchmark as implemented by Hendrycks et al. with the choices in context and the answer letters (e.g `A`, `B`, `C`, `D`) in the continuation.
`mmlu_continuation` is a cloze-style variant without the choices in context and the full answer choice in the continuation.
`mmlu_generation` is a generation variant, similar to the original but the LLM is asked to generate the correct answer letter.
#### Subgroups
* `mmlu_stem'
* `mmlu_humanities'
* `mmlu_social_sciences'
* `mmlu_other'
Subgroup variants are prefixed with the subgroup name, e.g. `mmlu_stem_continuation`.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
# changelog
ver 1: PR #497
switch to original implementation
ver 2: PR #2116
add missing newline in description.
|