Spaces:
Sleeping
Sleeping
File size: 2,673 Bytes
9f41814 c4b8e0d 9f41814 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# Evaluation Guidelines
## 1. Selected Metrics
### 1.1 Correctness
Combines elements of:
- **coverage**: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
- **relevance**: portion of the generated response which is directly addressing the question, regardless its factual correctness.
Graded on a continuous scale with the following representative points:
- **2:** Correct and relevant (no irrelevant information)
- **1:** Correct but contains irrelevant information
- **0:** No answer provided (abstention)
- **-1:** Incorrect answer
### 1.2 Faithfulness
Assesses whether the response is **grounded in the retrieved passages**. This metric reimplements the work discussed in [2].
Graded on a continuous scale with the following representative points:
- **1:** Full support. All answer parts are grounded
- **0:** Partial support. Not all answer parts are grounded
- **-1:** No support. All answer parts are not grounded
### 1.3 Aggregation of Metrics
Both **correctness** and **faithfulness** will contribute to the final evaluation score.
## 2. Manual and Automated Evaluation
### **2.1 First Stage:**
- Automated evaluation by a state-of-the-art LLM, using **correctness** and **faithfulness** metrics to rank the participant teams.
### **2.2 Final Stage:**
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
## 3. Other Notable Points
- Answer length is **unlimited** but only the first **300 words** will be evaluated.
- Participants will submit (see Answer file [json schema](Answer_File.json.schema) and [example](Answer_File_Example.json)):
- **The Question ID**.
- **The Question**.
- **The answer**.
- **Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs**.
- **The full prompt used for generation**.
- Remarks:
- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
- We accept partial submissions where not all questions are answered
These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.
## References
[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track
[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024 |