Spaces:
Sleeping
Evaluation Guidelines
1. Selected Metrics
1.1 Correctness
Combines elements of:
- coverage: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
- relevance: portion of the generated response which is directly addressing the question, regardless its factual correctness.
Graded on a continuous scale with the following representative points:
- 2: Correct and relevant (no irrelevant information)
- 1: Correct but contains irrelevant information
- 0: No answer provided (abstention)
- -1: Incorrect answer
1.2 Faithfulness
Assesses whether the response is grounded in the retrieved passages. This metric reimplements the work discussed in [2].
Graded on a continuous scale with the following representative points:
- 1: Full support. All answer parts are grounded
- 0: Partial support. Not all answer parts are grounded
- -1: No support. All answer parts are not grounded
1.3 Aggregation of Metrics
Both correctness and faithfulness will contribute to the final evaluation score.
2. Manual and Automated Evaluation
2.1 First Stage:
- Automated evaluation by a state-of-the-art LLM, using correctness and faithfulness metrics to rank the participant teams.
2.2 Final Stage:
- Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.
3. Other Notable Points
- Answer length is unlimited but only the first 300 words will be evaluated.
- Participants will submit (see Answer file json schema and example):
- The Question ID.
- The Question.
- The answer.
- Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs.
- The full prompt used for generation.
- Remarks:
- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
- We accept partial submissions where not all questions are answered
These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.
References
[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track
[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024