File size: 3,284 Bytes
9ad548c
 
 
 
cd8390d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ad548c
cd8390d
 
 
 
 
 
9ad548c
 
cd8390d
 
 
 
 
 
 
0e35dfb
cd8390d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1aaa2fe
cd8390d
 
 
 
 
 
 
 
 
 
 
 
 
 
1713513
cd8390d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
<!--
keywords: LLM hallucination leaderboard submission, Verify leaderboard guidelines, kluster.ai, hallucination benchmark contributions, large language model evaluation submission
-->

# LLM Hallucination Detection Leaderboard Submission Guidelines

Thank you for your interest in contributing to the **LLM Hallucination Detection Leaderboard**!  We welcome submissions from researchers and practitioners who have built or finetuned language models that can be evaluated on our hallucination benchmarks.

---

## 1. What to Send

Please email **ryan@kluster.ai** with the subject line:

```
[Verify Leaderboard Submission] <Your-Model-Name>
```

Attach **one ZIP file** that contains **all of the following**:

1. **`model_card.md`**: A short Markdown file describing your model:  
   • Name and version  
   • Architecture / base model  
   • Training or finetuning procedure  
   • License  
   • Intended use & known limitations  
   • Contact information
2. **`results.csv`**: A CSV file with **one row per prompt** and **one column per field** (see schema below).
3. (Optional) **`extra_notes.md`**: Anything else you would like us to know (e.g., additional analysis).

---

## 2. CSV Schema

| Column             | Description                                                               |
|--------------------|---------------------------------------------------------------------------|
| `request`          | The exact input request provided to the model. This must follow the same request structure and prompt format as described in Details section. |
| `response`         | The raw output produced by the model.                                     |
| `verify_response`  | The  Verify judgment or explanation regarding hallucination.      |
| `verify_label`     | The final boolean / categorical label (e.g., `TRUE`, `FALSE`).   |
| `task`             | The benchmark or dataset name the sample comes from.            |

**Important:**  Use UTF-8 encoding and **do not** add additional columns without prior discussion; extra information should go in the `metadata` field. You must use Verify by kluster.ai to ensure fairness in the leaderboard.

---

## 3. Evaluation Datasets

Run your model on the following public datasets and include *all* examples in your CSV.  You can load them directly from Hugging Face:

| Dataset | Hugging Face Link |
|---------|-------------------|
| HaluEval QA (qa_samples subet with Question and Knowledge column) | https://huggingface.co/datasets/pminervini/HaluEval |
| UltraChat | https://huggingface.co/datasets/kluster-ai/ultrachat-sampled |

---

## 5. Example Row

```csv
request,response,verify_response,verify_label,task
"What is the capital of the UK?","London is the capital of the UK.","The statement is factually correct.",TRUE,TruthfulQA
```

---

## 6. Review Process

1. We will sanity-check the file format and reproduce a random subset.  
2. If everything looks good, your scores will appear on the public leaderboard. 
3. We may reach out for clarifications, please keep an eye on your inbox.

---

## 7. Contact

Questions?  Email **ryan@kluster.ai** or join our Discord [here](https://discord.com/invite/klusterai).

We look forward to your submissions and to advancing reliable language models together!