BayesTensor
/

out

Generated from Trainer

4-bit precision

Model card Files Files and versions

out / lm-evaluation-harness /lm_eval /tasks /minerva_math /README.md

BayesTensor's picture

Upload folder using huggingface_hub

9d5b280 verified 7 months ago

|

history blame contribute delete

2.97 kB

	# MATH
	ℹ️ This is the 4-shot variant!
	## Paper
	Measuring Mathematical Problem Solving With the MATH Dataset
	https://arxiv.org/abs/2103.03874

	Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

	NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be installed via the `lm-eval[math]` extra.

	Homepage: https://github.com/hendrycks/math


	## Citation
	```
	@article{hendrycksmath2021,
	title={Measuring Mathematical Problem Solving With the MATH Dataset},
	author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
	journal={NeurIPS},
	year={2021}
	}

	@misc{2206.14858,
	Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dyer and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra},
	Title = {Solving Quantitative Reasoning Problems with Language Models},
	Year = {2022},
	Eprint = {arXiv:2206.14858},
	}
	```

	### Groups and Tasks

	#### Groups

	- `minerva_math`

	#### Tasks

	- `minerva_math_algebra`
	- `minerva_math_counting_and_prob`
	- `minerva_math_geometry`
	- `minerva_math_intermediate_algebra`
	- `minerva_math_num_theory`
	- `minerva_math_prealgebra`
	- `minerva_math_precalc`

	### Checklist

	The checklist is the following:

	For adding novel benchmarks/datasets to the library:
	* [x] Is the task an existing benchmark in the literature?
	* [x] Have you referenced the original paper that introduced the task?
	* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
	* The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical.


	If other tasks on this dataset are already supported:
	* [x] Is the "Main" variant of this task clearly denoted?
	* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
	* [x] Have you noted which, if any, published evaluation setups are matched by this variant?

	### Variant Wishlist

	- [ ] zero-shot variant