BayesTensor's picture
Upload folder using huggingface_hub
9d5b280 verified

SpanishBench

Paper

SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.

The new evaluation datasets included in SpanishBench are:

Task Category Homepage
COPA-es Commonsense Reasoning https://huggingface.co/datasets/BSC-LT/COPA-es
OpenBookQA_es Question Answering https://huggingface.co/datasets/BSC-LT/openbookqa-es

The datasets included in SpanishBench that have been made public in previous publications are:

Citation

Paper for SpanishBench coming soon.

Groups and Tasks

Groups

  • spanish_bench: All tasks included in SpanishBench.
  • flores_es: All FLORES translation tasks from or to Spanish.

Tags

  • phrases_es: Two Phrases_va tasks for language adaptation between Spanish and Valencian.

Tasks

The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.

  • belebele_spa_Latn
  • copa_es
  • escola
  • flores_es
  • flores_es-ca
  • flores_es-de
  • flores_es-en
  • flores_es-eu
  • flores_es-fr
  • flores_es-gl
  • flores_es-it
  • flores_es-pt
  • flores_ca-es
  • flores_de-es
  • flores_en-es
  • flores_eu-es
  • flores_fr-es
  • flores_gl-es
  • flores_it-es
  • flores_pt-es
  • mgsm_direct_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
  • openbookqa_es
  • paws_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
  • phrases_es
  • wnli_es
  • xlsum_es
  • xnli_es_spanish_bench (spanish_bench is due to an existing open issue in the original task)
  • xquad_es
  • xstorycloze_es

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:

  • belebele_spa_Latn: Belebele Spanish
  • mgsm_direct_es: MGSM Spanish (fixed an existing open issue in the original task)
  • paws_es: PAWS-X Spanish (fixed an existing open issue in the original task)
  • xnli_es: XNLI Spanish (fixed an existing open issue in the original task)
  • xstorycloze_es: XStoryCloze Spanish

Checklist

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation?
      • Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?