BayesTensor
/

out

Generated from Trainer

4-bit precision

Model card Files Files and versions

out / lm-evaluation-harness /lm_eval /tasks /lingoly /README.md

BayesTensor's picture

Upload folder using huggingface_hub

9d5b280 verified 7 months ago

|

history blame contribute delete

2.9 kB

	# LingOly


	### Paper

	Title: `LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages`

	Abstract: `https://arxiv.org/abs/2406.06196`

	`In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.`

	Homepage: `https://github.com/am-bean/lingOly`


	### Citation

	```
	@article{beanLINGOLYBenchmarkOlympiadLevel2024,
	title = {{LINGOLY}: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages},
	shorttitle = {{LINGOLY}},
	url = {http://arxiv.org/abs/2406.06196},
	author = {Bean, Andrew M. and Hellsten, Simi and Mayne, Harry and Magomere, Jabez and Chi, Ethan A. and Chi, Ryan and Hale, Scott A. and Kirk, Hannah Rose},
	month = jun,
	year = {2024},
	keywords = {Computer Science - Computation and Language}
	}
	```

	### Tasks

	* `lingoly`: `runs both _context and _nocontext and computes the difference`
	* `lingoly_context`: `exact match of generations to reference answers`
	* `lingoly_nocontext`: `exact match of generations to reference answers, but with context removed`

	### Checklist

	For adding novel benchmarks/datasets to the library:
	* [x] Is the task an existing benchmark in the literature?
	* [x] Have you referenced the original paper that introduced the task?
	* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


	If other tasks on this dataset are already supported:
	* [ ] Is the "Main" variant of this task clearly denoted?
	* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
	* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?