Upload folder using huggingface_hub

9d5b280 verified 7 months ago

4.46 kB

	# BasqueBench

	### Paper

	BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.

	The new evaluation datasets included in BasqueBench are:
	\| Task \| Category \| Homepage \|
	\|:-------------:\|:-----:\|:-----:\|
	\| MGSM_eu \| Math \| https://huggingface.co/datasets/HiTZ/MGSM-eu \|
	\| PIQA_eu \| Question Answering \| https://huggingface.co/datasets/HiTZ/PIQA-eu \|
	\| WNLI_eu \| Natural Language Inference \| https://huggingface.co/datasets/HiTZ/wnli-eu \|
	\| XCOPA_eu \| Commonsense Reasoning \| https://huggingface.co/datasets/HiTZ/XCOPA-eu \|

	The datasets included in BasqueBench that have been made public in previous pubications are:

	\| Task \| Category \| Paper title \| Homepage \|
	\|:-------------:\|:-----:\|:-------------:\|:-----:\|
	\| Belebele_eu \| Reading Comprehension \| [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) \| https://huggingface.co/datasets/facebook/belebele \|
	\| EusExams \| Question Answering \| [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) \| https://huggingface.co/datasets/HiTZ/EusExams \|
	\| EusProficiency \| Question Answering \| [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) \| https://huggingface.co/datasets/HiTZ/EusProficiency \|
	\| EusReading \| Reading Comprehension \| [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) \| https://huggingface.co/datasets/HiTZ/EusReading \|
	\| EusTrivia \| Question Answering \| [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) \| https://huggingface.co/datasets/HiTZ/EusTrivia \|
	\| FLORES_eu \| Translation \| [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) \| https://huggingface.co/datasets/facebook/flores \|
	\| QNLIeu \| Natural Language Inference \| [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/) \| https://huggingface.co/datasets/orai-nlp/basqueGLUE \|
	\| XNLIeu \| Natural Language Inference \| [XNLIeu: a dataset for cross-lingual NLI in Basque](https://arxiv.org/abs/2404.06996) \| https://huggingface.co/datasets/HiTZ/xnli-eu \|
	\| XStoryCloze_eu \| Commonsense Reasoning \| [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) \| https://huggingface.co/datasets/juletxara/xstory_cloze \|


	### Citation
	Paper for BasqueBench coming soon.

	### Groups and Tasks

	#### Groups

	- `basque_bench`: All tasks included in BasqueBench.
	- `flores_eu`: All FLORES translation tasks from or to Basque.

	#### Tasks

	The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
	- `belebele_eus_Latn`
	- `eus_exams_eu`
	- `eus_proficiency`
	- `eus_reading`
	- `eus_trivia`
	- `flores_eu`
	- `flores_eu-ca`
	- `flores_eu-de`
	- `flores_eu-en`
	- `flores_eu-es`
	- `flores_eu-fr`
	- `flores_eu-gl`
	- `flores_eu-it`
	- `flores_eu-pt`
	- `flores_ca-eu`
	- `flores_de-eu`
	- `flores_en-eu`
	- `flores_es-eu`
	- `flores_fr-eu`
	- `flores_gl-eu`
	- `flores_it-eu`
	- `flores_pt-eu`
	- `mgsm_direct_eu`
	- `mgsm_native_cot_eu`
	- `piqa_eu`
	- `qnlieu`
	- `wnli_eu`
	- `xcopa_eu`
	- `xnli_eu`
	- `xnli_eu_native`
	- `xstorycloze_eu`

	Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
	- `belebele_eus_Latn`: Belebele Basque
	- `qnlieu`: From BasqueGLUE


	### Checklist

	* [x] Is the task an existing benchmark in the literature?
	* [ ] Have you referenced the original paper that introduced the task?
	* [ ] If yes, does the original paper provide a reference implementation?
	* [ ] Yes, original implementation contributed by author of the benchmark

	If other tasks on this dataset are already supported:
	* [ ] Is the "Main" variant of this task clearly denoted?
	* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
	* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?