Challenge / Operational_Instructions /Evaluation_Guidelines_for_LiveRAG.md
Orensomekh's picture
Update Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md
c4b8e0d verified

Evaluation Guidelines

1. Selected Metrics

1.1 Correctness

Combines elements of:

  • coverage: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
  • relevance: portion of the generated response which is directly addressing the question, regardless its factual correctness.

Graded on a continuous scale with the following representative points:

  • 2: Correct and relevant (no irrelevant information)
  • 1: Correct but contains irrelevant information
  • 0: No answer provided (abstention)
  • -1: Incorrect answer

1.2 Faithfulness

Assesses whether the response is grounded in the retrieved passages. This metric reimplements the work discussed in [2].

Graded on a continuous scale with the following representative points:

  • 1: Full support. All answer parts are grounded
  • 0: Partial support. Not all answer parts are grounded
  • -1: No support. All answer parts are not grounded

1.3 Aggregation of Metrics

Both correctness and faithfulness will contribute to the final evaluation score.

2. Manual and Automated Evaluation

2.1 First Stage:

  • Automated evaluation by a state-of-the-art LLM, using correctness and faithfulness metrics to rank the participant teams.

2.2 Final Stage:

  • Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

3. Other Notable Points

  • Answer length is unlimited but only the first 300 words will be evaluated.
  • Participants will submit (see Answer file json schema and example):
    • The Question ID.
    • The Question.
    • The answer.
    • Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs.
    • The full prompt used for generation.
  • Remarks:
    • Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
    • We accept partial submissions where not all questions are answered

These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.

References

[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track

[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024