SpanishBench

Paper

SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.

The new evaluation datasets included in SpanishBench are:

Task	Category	Homepage
COPA-es	Commonsense Reasoning	https://huggingface.co/datasets/BSC-LT/COPA-es
OpenBookQA_es	Question Answering	https://huggingface.co/datasets/BSC-LT/openbookqa-es

The datasets included in SpanishBench that have been made public in previous publications are:

Task	Category	Paper title	Homepage
Belebele_es	Reading Comprehension	The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants	https://huggingface.co/datasets/facebook/belebele
EsCoLA	Linguistic Acceptability	EsCoLA: Spanish Corpus of Linguistic Acceptability	https://huggingface.co/datasets/nbel/EsCoLA
FLORES_es	Translation	The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation	https://huggingface.co/datasets/facebook/flores
MGSM_es	Math	Language Models are Multilingual Chain-of-Thought Reasoners	https://huggingface.co/datasets/juletxara/mgsm
PAWS-X_es	Paraphrasing	PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification	https://huggingface.co/datasets/google-research-datasets/paws-x
WNLI-es	Natural Language Inference	No paper.	https://huggingface.co/datasets/PlanTL-GOB-ES/wnli-es
XL-Sum_es	Summarization	XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages	https://huggingface.co/datasets/csebuetnlp/xlsum
XNLI_es	Natural Language Inference	XNLI: Evaluating Cross-lingual Sentence Representations	https://huggingface.co/datasets/facebook/xnli
XQuAD_es	Question Answering	On the Cross-lingual Transferability of Monolingual Representations	https://huggingface.co/datasets/google/xquad
XStoryCloze_es	Commonsense Reasoning	Few-shot Learning with Multilingual Generative Language Models	https://huggingface.co/datasets/juletxara/xstory_cloze

Citation

Paper for SpanishBench coming soon.

Groups and Tasks

Groups

spanish_bench: All tasks included in SpanishBench.
flores_es: All FLORES translation tasks from or to Spanish.

Tasks

The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.

belebele_spa_Latn
copa_es
escola
flores_es
flores_es-ca
flores_es-de
flores_es-en
flores_es-eu
flores_es-fr
flores_es-gl
flores_es-it
flores_es-pt
flores_ca-es
flores_de-es
flores_en-es
flores_eu-es
flores_fr-es
flores_gl-es
flores_it-es
flores_pt-es
mgsm_direct_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
openbookqa_es
paws_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
phrases_es
wnli_es
xlsum_es
xnli_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
xquad_es
xstorycloze_es

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:

belebele_spa_Latn: Belebele Spanish
mgsm_direct_es: MGSM Spanish (fixed an existing open issue in the original task)
paws_es: PAWS-X Spanish (fixed an existing open issue in the original task)
xnli_es: XNLI Spanish (fixed an existing open issue in the original task)
xstorycloze_es: XStoryCloze Spanish

Checklist

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
  - Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?

BayesTensor
/

out