out / lm-evaluation-harness /lm_eval /tasks /xnli /README.md

Upload folder using huggingface_hub

9d5b280 verified 7 months ago

2.22 kB

	# XNLI

	### Paper

	Title: `XNLI: Evaluating Cross-lingual Sentence Representations`

	Abstract: https://arxiv.org/abs/1809.05053

	Based on the implementation of @yongzx (see https://github.com/EleutherAI/lm-evaluation-harness/pull/258)

	Prompt format (same as XGLM and mGPT):

	sentence1 + ", right? " + mask = (Yes\|Also\|No) + ", " + sentence2

	Predicition is the full sequence with the highest likelihood.

	Language specific prompts are translated word-by-word with Google Translate
	and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).

	Homepage: https://github.com/facebookresearch/XNLI


	### Citation

	"""
	@InProceedings{conneau2018xnli,
	author = "Conneau, Alexis
	and Rinott, Ruty
	and Lample, Guillaume
	and Williams, Adina
	and Bowman, Samuel R.
	and Schwenk, Holger
	and Stoyanov, Veselin",
	title = "XNLI: Evaluating Cross-lingual Sentence Representations",
	booktitle = "Proceedings of the 2018 Conference on Empirical Methods
	in Natural Language Processing",
	year = "2018",
	publisher = "Association for Computational Linguistics",
	location = "Brussels, Belgium",
	}
	"""

	### Groups and Tasks

	#### Groups

	* `xnli`

	#### Tasks

	* `xnli_ar`: Arabic
	* `xnli_bg`: Bulgarian
	* `xnli_de`: German
	* `xnli_el`: Greek
	* `xnli_en`: English
	* `xnli_es`: Spanish
	* `xnli_fr`: French
	* `xnli_hi`: Hindi
	* `xnli_ru`: Russian
	* `xnli_sw`: Swahili
	* `xnli_th`: Thai
	* `xnli_tr`: Turkish
	* `xnli_ur`: Urdu
	* `xnli_vi`: Vietnamese
	* `xnli_zh`: Chinese

	### Checklist

	For adding novel benchmarks/datasets to the library:
	* [ ] Is the task an existing benchmark in the literature?
	* [ ] Have you referenced the original paper that introduced the task?
	* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


	If other tasks on this dataset are already supported:
	* [ ] Is the "Main" variant of this task clearly denoted?
	* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
	* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?