Spaces:

LiveRAG
/

Challenge

Sleeping

App Files Files Community

Challenge / Operational_Instructions /Evaluation_Guidelines_for_LiveRAG.md

Orensomekh

Update Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md

c4b8e0d verified 4 months ago

preview code

raw

history blame contribute delete

2.67 kB

	# Evaluation Guidelines

	## 1. Selected Metrics

	### 1.1 Correctness
	Combines elements of:
	- coverage: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
	- relevance: portion of the generated response which is directly addressing the question, regardless its factual correctness.

	Graded on a continuous scale with the following representative points:
	- 2: Correct and relevant (no irrelevant information)
	- 1: Correct but contains irrelevant information
	- 0: No answer provided (abstention)
	- -1: Incorrect answer

	### 1.2 Faithfulness
	Assesses whether the response is grounded in the retrieved passages. This metric reimplements the work discussed in [2].

	Graded on a continuous scale with the following representative points:
	- 1: Full support. All answer parts are grounded
	- 0: Partial support. Not all answer parts are grounded
	- -1: No support. All answer parts are not grounded

	### 1.3 Aggregation of Metrics
	Both correctness and faithfulness will contribute to the final evaluation score.

	## 2. Manual and Automated Evaluation

	### 2.1 First Stage:
	- Automated evaluation by a state-of-the-art LLM, using correctness and faithfulness metrics to rank the participant teams.

	### 2.2 Final Stage:
	- Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

	## 3. Other Notable Points
	- Answer length is unlimited but only the first 300 words will be evaluated.
	- Participants will submit (see Answer file [json schema](Answer_File.json.schema) and [example](Answer_File_Example.json)):
	- The Question ID.
	- The Question.
	- The answer.
	- Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs.
	- The full prompt used for generation.
	- Remarks:
	- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
	- We accept partial submissions where not all questions are answered


	These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.

	## References

	[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track

	[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024

	# Evaluation Guidelines

	## 1. Selected Metrics

	### 1.1 Correctness
	Combines elements of:
	- coverage: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
	- relevance: portion of the generated response which is directly addressing the question, regardless its factual correctness.

	Graded on a continuous scale with the following representative points:
	- 2: Correct and relevant (no irrelevant information)
	- 1: Correct but contains irrelevant information
	- 0: No answer provided (abstention)
	- -1: Incorrect answer

	### 1.2 Faithfulness
	Assesses whether the response is grounded in the retrieved passages. This metric reimplements the work discussed in [2].

	Graded on a continuous scale with the following representative points:
	- 1: Full support. All answer parts are grounded
	- 0: Partial support. Not all answer parts are grounded
	- -1: No support. All answer parts are not grounded

	### 1.3 Aggregation of Metrics
	Both correctness and faithfulness will contribute to the final evaluation score.

	## 2. Manual and Automated Evaluation

	### 2.1 First Stage:
	- Automated evaluation by a state-of-the-art LLM, using correctness and faithfulness metrics to rank the participant teams.

	### 2.2 Final Stage:
	- Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

	## 3. Other Notable Points
	- Answer length is unlimited but only the first 300 words will be evaluated.
	- Participants will submit (see Answer file [json schema](Answer_File.json.schema) and [example](Answer_File_Example.json)):
	- The Question ID.
	- The Question.
	- The answer.
	- Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs.
	- The full prompt used for generation.
	- Remarks:
	- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
	- We accept partial submissions where not all questions are answered


	These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.

	## References

	[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track

	[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024