Evaluation Guidelines

1. Selected Metrics

1.1 Correctness

Combines elements of:

coverage: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
relevance: portion of the generated response which is directly addressing the question, regardless its factual correctness.

Graded on a continuous scale with the following representative points:

2: Correct and relevant (no irrelevant information)
1: Correct but contains irrelevant information
0: No answer provided (abstention)
-1: Incorrect answer

1.2 Faithfulness

Assesses whether the response is grounded in the retrieved passages. This metric reimplements the work discussed in [2].

Graded on a continuous scale with the following representative points:

1: Full support. All answer parts are grounded
0: Partial support. Not all answer parts are grounded
-1: No support. All answer parts are not grounded

1.3 Aggregation of Metrics

Both correctness and faithfulness will contribute to the final evaluation score.

2. Manual and Automated Evaluation

2.1 First Stage:

Automated evaluation by a state-of-the-art LLM, using correctness and faithfulness metrics to rank the participant teams.

2.2 Final Stage:

Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

3. Other Notable Points

Answer length is unlimited but only the first 300 words will be evaluated.
Participants will submit (see Answer file json schema and example):
- The Question ID.
- The Question.
- The answer.
- Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs.
- The full prompt used for generation.
Remarks:
- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
- We accept partial submissions where not all questions are answered

These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.

References

[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track

[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024