Spaces:

kluster-ai
/

LLM-Hallucination-Detection-Leaderboard

Running

App Files Files Community

LLM-Hallucination-Detection-Leaderboard / submit.md

rymc

add RAG experiments (#7)

9ad548c verified 29 days ago

preview code

raw

history blame contribute delete

3.28 kB

	<!--
	keywords: LLM hallucination leaderboard submission, Verify leaderboard guidelines, kluster.ai, hallucination benchmark contributions, large language model evaluation submission
	-->

	# LLM Hallucination Detection Leaderboard Submission Guidelines

	Thank you for your interest in contributing to the LLM Hallucination Detection Leaderboard! We welcome submissions from researchers and practitioners who have built or finetuned language models that can be evaluated on our hallucination benchmarks.

	---

	## 1. What to Send

	Please email ryan@kluster.ai with the subject line:

	```
	[Verify Leaderboard Submission] <Your-Model-Name>
	```

	Attach one ZIP file that contains all of the following:

	1. `model_card.md`: A short Markdown file describing your model:
	• Name and version
	• Architecture / base model
	• Training or finetuning procedure
	• License
	• Intended use & known limitations
	• Contact information
	2. `results.csv`: A CSV file with one row per prompt and one column per field (see schema below).
	3. (Optional) `extra_notes.md`: Anything else you would like us to know (e.g., additional analysis).

	---

	## 2. CSV Schema

	\| Column \| Description \|
	\|--------------------\|---------------------------------------------------------------------------\|
	\| `request` \| The exact input request provided to the model. This must follow the same request structure and prompt format as described in Details section. \|
	\| `response` \| The raw output produced by the model. \|
	\| `verify_response` \| The Verify judgment or explanation regarding hallucination. \|
	\| `verify_label` \| The final boolean / categorical label (e.g., `TRUE`, `FALSE`). \|
	\| `task` \| The benchmark or dataset name the sample comes from. \|

	Important: Use UTF-8 encoding and do not add additional columns without prior discussion; extra information should go in the `metadata` field. You must use Verify by kluster.ai to ensure fairness in the leaderboard.

	---

	## 3. Evaluation Datasets

	Run your model on the following public datasets and include all examples in your CSV. You can load them directly from Hugging Face:

	\| Dataset \| Hugging Face Link \|
	\|---------\|-------------------\|
	\| HaluEval QA (qa_samples subet with Question and Knowledge column) \| https://huggingface.co/datasets/pminervini/HaluEval \|
	\| UltraChat \| https://huggingface.co/datasets/kluster-ai/ultrachat-sampled \|

	---

	## 5. Example Row

	```csv
	request,response,verify_response,verify_label,task
	"What is the capital of the UK?","London is the capital of the UK.","The statement is factually correct.",TRUE,TruthfulQA
	```

	---

	## 6. Review Process

	1. We will sanity-check the file format and reproduce a random subset.
	2. If everything looks good, your scores will appear on the public leaderboard.
	3. We may reach out for clarifications, please keep an eye on your inbox.

	---

	## 7. Contact

	Questions? Email ryan@kluster.ai or join our Discord [here](https://discord.com/invite/klusterai).

	We look forward to your submissions and to advancing reliable language models together!