AlignScoreCS / README.md

Update README.md

4c7bc4e verified about 1 year ago

6.98 kB

	---
	language:
	- en
	- cs
	license: cc-by-4.0
	metrics:
	- bleurt
	- bleu
	- bertscore
	---
	# AlignScoreCS

	A MultiTask multilingual model is developed to assess factual consistency in context-claim pairs across various Natural Language Understanding (NLU) tasks,
	including Summarization, Question Answering (QA), Semantic Textual Similarity (STS), Paraphrase, Fact Verification (FV), and Natural Language Inference (NLI).
	AlignScoreCS is fine-tuned on a vast multi-task dataset consisting of 7 million documents, encompassing these NLU tasks in both Czech and English languages.
	Its multilingual pre-training enables its potential utilization in various other languages. The architecture is capable of processing tasks using regression,
	binary classification, or ternary classification, although for evaluation purposes, we recommend employing the AlignScore function.

	This work is influenced by its English counterpart [AlignScore: Evaluating Factual Consistency with a Unified Alignment Function](https://arxiv.org/abs/2305.16739).
	However, we employed homogeneous batches instead of heterogeneous ones during training and utilized three distinct architectures sharing a single encoder.
	This setup allows for the independent use of each architecture with its classification head.


	## Evaluation
	As in the paper AlignScore, we use their AlignScore function which chunk context into roughly 350 tokens and splits claim into sentences
	each context chunk is evaluated against each claim sentence and aggregated one consistency score

	AlignScoreCS model is built on three XLM-RoBERTa architectures sharing one encoder


	MultiTask multilingual model for assessing facticity in various NLU tasks in Czech and English language. We followed the initial paper AlignScore https://arxiv.org/abs/2305.16739.
	We trained a model using a shared architecture of checkpoint xlm-roberta-large [xlm-roberta](https://huggingface.co/FacebookAI/xlm-roberta-large) with three linear layers for regression,
	binary classification and ternary classification.


	# Usage
	```python
	# Assuming you copied the attached Files_and_versions/AlignScore.py file for ease of use in transformers.
	from AlignScoreCS import AlignScoreCS
	alignScoreCS = AlignScoreCS.from_pretrained("krotima1/AlignScoreCS")
	# put the model to cuda to accelerate
	print(alignScoreCS.score(context="This is context", claim="This is claim"))

	```

	# Results



	# Training datasets
	The following table shows datasets that has been utilized for training the model. We translated these english datasets to Czech using seamLessM4t.

	\| NLP Task \| Dataset \| Training Task \| Context (n words) \| Claim (n words) \| Sample Count \|
	\|-----------------------\|-------------------\|---------------\|-------------------\|-----------------\|--------------\|
	\| NLI \| SNLI \| 3-way \| 10 \| 13 \| Cs: 500k \|
	\| \| \| \| \| \| En: 550k \|
	\| \| MultiNLI \| 3-way \| 16 \| 20 \| Cs: 393k \|
	\| \| \| \| \| \| En: 393k \|
	\| \| Adversarial NLI \| 3-way \| 48 \| 54 \| Cs: 163k \|
	\| \| \| \| \| \| En: 163k \|
	\| \| DocNLI \| 2-way \| 97 \| 285 \| Cs: 200k \|
	\| \| \| \| \| \| En: 942k \|
	\| Fact Verification \| NLI-style FEVER \| 3-way \| 48 \| 50 \| Cs: 208k \|
	\| \| \| \| \| \| En: 208k \|
	\| \| Vitamin C \| 3-way \| 23 \| 25 \| Cs: 371k \|
	\| \| \| \| \| \| En: 371k \|
	\| Paraphrase \| QQP \| 2-way \| 9 \| 11 \| Cs: 162k \|
	\| \| \| \| \| \| En: 364k \|
	\| \| PAWS \| 2-way \| - \| 18 \| Cs: - \|
	\| \| \| \| \| \| En: 707k \|
	\| \| PAWS labeled \| 2-way \| 18 \| - \| Cs: 49k \|
	\| \| \| \| \| \| En: - \|
	\| \| PAWS unlabeled \| 2-way \| 18 \| - \| Cs: 487k \|
	\| \| \| \| \| \| En: - \|
	\| STS \| SICK \| reg \| - \| 10 \| Cs: - \|
	\| \| \| \| \| \| En: 4k \|
	\| \| STS Benchmark \| reg \| - \| 10 \| Cs: - \|
	\| \| \| \| \| \| En: 6k \|
	\| \| Free-N1 \| reg \| 18 \| - \| Cs: 20k \|
	\| \| \| \| \| \| En: - \|
	\| QA \| SQuAD v2 \| 2-way \| 105 \| 119 \| Cs: 130k \|
	\| \| \| \| \| \| En: 130k \|
	\| \| RACE \| 2-way \| 266 \| 273 \| Cs: 200k \|
	\| \| \| \| \| \| En: 351k \|
	\| Information Retrieval\| MS MARCO \| 2-way \| 49 \| 56 \| Cs: 200k \|
	\| \| \| \| \| \| En: 5M \|
	\| Summarization \| WikiHow \| 2-way \| 434 \| 508 \| Cs: 157k \|
	\| \| \| \| \| \| En: 157k \|
	\| \| SumAug \| 2-way \| - \| - \| Cs: - \|
	\| \| \| \| \| \| En: - \|

	---
	language:
	- en
	- cs
	license: cc-by-4.0
	metrics:
	- bleurt
	- bleu
	- bertscore
	---
	# AlignScoreCS

	A MultiTask multilingual model is developed to assess factual consistency in context-claim pairs across various Natural Language Understanding (NLU) tasks,
	including Summarization, Question Answering (QA), Semantic Textual Similarity (STS), Paraphrase, Fact Verification (FV), and Natural Language Inference (NLI).
	AlignScoreCS is fine-tuned on a vast multi-task dataset consisting of 7 million documents, encompassing these NLU tasks in both Czech and English languages.
	Its multilingual pre-training enables its potential utilization in various other languages. The architecture is capable of processing tasks using regression,
	binary classification, or ternary classification, although for evaluation purposes, we recommend employing the AlignScore function.

	This work is influenced by its English counterpart [AlignScore: Evaluating Factual Consistency with a Unified Alignment Function](https://arxiv.org/abs/2305.16739).
	However, we employed homogeneous batches instead of heterogeneous ones during training and utilized three distinct architectures sharing a single encoder.
	This setup allows for the independent use of each architecture with its classification head.


	## Evaluation
	As in the paper AlignScore, we use their AlignScore function which chunk context into roughly 350 tokens and splits claim into sentences
	each context chunk is evaluated against each claim sentence and aggregated one consistency score

	AlignScoreCS model is built on three XLM-RoBERTa architectures sharing one encoder


	MultiTask multilingual model for assessing facticity in various NLU tasks in Czech and English language. We followed the initial paper AlignScore https://arxiv.org/abs/2305.16739.
	We trained a model using a shared architecture of checkpoint xlm-roberta-large [xlm-roberta](https://huggingface.co/FacebookAI/xlm-roberta-large) with three linear layers for regression,
	binary classification and ternary classification.


	# Usage
	```python
	# Assuming you copied the attached Files_and_versions/AlignScore.py file for ease of use in transformers.
	from AlignScoreCS import AlignScoreCS
	alignScoreCS = AlignScoreCS.from_pretrained("krotima1/AlignScoreCS")
	# put the model to cuda to accelerate
	print(alignScoreCS.score(context="This is context", claim="This is claim"))

	```

	# Results



	# Training datasets
	The following table shows datasets that has been utilized for training the model. We translated these english datasets to Czech using seamLessM4t.

	\| NLP Task \| Dataset \| Training Task \| Context (n words) \| Claim (n words) \| Sample Count \|
	\|-----------------------\|-------------------\|---------------\|-------------------\|-----------------\|--------------\|
	\| NLI \| SNLI \| 3-way \| 10 \| 13 \| Cs: 500k \|
	\| \| \| \| \| \| En: 550k \|
	\| \| MultiNLI \| 3-way \| 16 \| 20 \| Cs: 393k \|
	\| \| \| \| \| \| En: 393k \|
	\| \| Adversarial NLI \| 3-way \| 48 \| 54 \| Cs: 163k \|
	\| \| \| \| \| \| En: 163k \|
	\| \| DocNLI \| 2-way \| 97 \| 285 \| Cs: 200k \|
	\| \| \| \| \| \| En: 942k \|
	\| Fact Verification \| NLI-style FEVER \| 3-way \| 48 \| 50 \| Cs: 208k \|
	\| \| \| \| \| \| En: 208k \|
	\| \| Vitamin C \| 3-way \| 23 \| 25 \| Cs: 371k \|
	\| \| \| \| \| \| En: 371k \|
	\| Paraphrase \| QQP \| 2-way \| 9 \| 11 \| Cs: 162k \|
	\| \| \| \| \| \| En: 364k \|
	\| \| PAWS \| 2-way \| - \| 18 \| Cs: - \|
	\| \| \| \| \| \| En: 707k \|
	\| \| PAWS labeled \| 2-way \| 18 \| - \| Cs: 49k \|
	\| \| \| \| \| \| En: - \|
	\| \| PAWS unlabeled \| 2-way \| 18 \| - \| Cs: 487k \|
	\| \| \| \| \| \| En: - \|
	\| STS \| SICK \| reg \| - \| 10 \| Cs: - \|
	\| \| \| \| \| \| En: 4k \|
	\| \| STS Benchmark \| reg \| - \| 10 \| Cs: - \|
	\| \| \| \| \| \| En: 6k \|
	\| \| Free-N1 \| reg \| 18 \| - \| Cs: 20k \|
	\| \| \| \| \| \| En: - \|
	\| QA \| SQuAD v2 \| 2-way \| 105 \| 119 \| Cs: 130k \|
	\| \| \| \| \| \| En: 130k \|
	\| \| RACE \| 2-way \| 266 \| 273 \| Cs: 200k \|
	\| \| \| \| \| \| En: 351k \|
	\| Information Retrieval\| MS MARCO \| 2-way \| 49 \| 56 \| Cs: 200k \|
	\| \| \| \| \| \| En: 5M \|
	\| Summarization \| WikiHow \| 2-way \| 434 \| 508 \| Cs: 157k \|
	\| \| \| \| \| \| En: 157k \|
	\| \| SumAug \| 2-way \| - \| - \| Cs: - \|
	\| \| \| \| \| \| En: - \|