File size: 6,396 Bytes
73adc36
 
 
 
5484928
73adc36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c7b7b88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73adc36
 
 
 
 
 
 
 
 
 
c7b7b88
 
 
 
 
 
 
 
 
 
 
 
 
73adc36
 
 
 
5484928
73adc36
 
 
 
 
 
 
 
 
 
 
c7b7b88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# About

As large language models (LLMs) continue to improve, evaluating how well they avoid hallucinations (producing information that is unfaithful or factually incorrect) has become increasingly important. While many models claim to be reliable, their factual grounding can vary significantly across tasks and settings.

This leaderboard provides a standardised evaluation of how different LLMs perform on hallucination detection tasks. Our goal is to help researchers and developers understand which models are more trustworthy in both grounded (context-based) and open-ended (real-world knowledge) settings. We use [Verify](https://platform.kluster.ai/verify) by [kluster.ai](https://platform.kluster.ai/), an automated hallucination detection tool, to evaluate the factual consistency of model outputs.

---

# Tasks

We evaluate each model using two benchmarks:

## Retrieval-Augmented Generation (RAG setting)

RAG evaluates how well a model stays faithful to a provided context when answering a question. The input consists of a synthetic or real context paired with a relevant question. Models are expected to generate answers using **only the information given**, without adding external knowledge or contradicting the context.

- **Source**: [HaluEval QA](https://huggingface.co/datasets/pminervini/HaluEval/viewer/qa?views%5B%5D=qa)  
- **Dataset Size**: 10,000 question-context pairs  
- **Prompt Format**: Prompt with relevant context document
- **Temperature**: 0 (to enforce deterministic, grounded outputs)  
- **System Prompt**: Instructs the model to only use the document and avoid guessing.

### System prompt

This is the system prompt use to generate LLM output for RAG setting:

```

You are an assistant for question-answering tasks.  

Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT.  

You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge.  

The ANSWER also must not contradict information provided in the DOCUMENT.  

If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know.  

You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else.  

Use three sentences maximum and keep the answer concise.

```

### Prompt format

Each prompt is formatted as 

```

DOCUMENT:

{context}



QUESTION:

{question}

```

### Message structure

The models use the following message structure:

```python

messages = [{"role": "system", "content": system_prompt},

            {"role": "user", "content": prompt},]



```


## Real-World Knowledge (Non-RAG setting)

This setting evaluates how factually accurate a model is when **no context is provided**. The model must rely solely on its internal knowledge to answer a broad range of user questions across many topics. The answers are then verified using web search to determine factual correctness.

- **Source**: Filtered from [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) prompts
- **Dataset Size**: 11,746 single-turn user queries  
- **Prompt Format**: Single user prompt without additional context  
- **Temperature**: 1 (to reflect natural, fluent generation)  
- **System Prompt**: Encourages helpfulness, accuracy, and honesty when unsure.

### System prompt

This is the system prompt use to generate LLM output for Non-RAG setting:

```

You are a helpful, factual, and concise assistant. 

Always try to answer the user's question clearly and completely. 

Do not make up information. If you are unsure or lack the knowledge, say so.

```

### Message structure
The message structure for Non-RAG setting is the same as RAG setting.

---

# Evaluation Method

We use **Verify**, a hallucination detection tool built by kluster.ai, to classify model outputs:

- In the **RAG setting**, Verify checks if the output contradicts, fabricates, or strays from the input document.
- In the **real-world knowledge setting**, Verify uses search queries to fact-check the answer based on current, public information.

Each model's hallucination rate is computed as:

### Hallucination Rate = (Number of hallucinated outputs) / (Total number of prompts)


A **lower** hallucination rate indicates **better** performance.

---
# Example of Hallucination Detection

## RAG Example

Context
```

Jonathan Stark (born April 3, 1971) is a former professional tennis player from the United States. 

During his career he won two Grand Slam doubles titles (the 1994 French Open Men's Doubles and 

the 1995 Wimbledon Championships Mixed Doubles). He reached the men's singles final at the French Open in 1988, 

won the French Open men's doubles title in 1984, and helped France win the Davis Cup in 1991.

```

Prompt

```

Which tennis player won more Grand Slam titles, Henri Leconte or Jonathan Stark?

```

LLM Answer
```

Jonathan Stark won two Grand Slam titles, while Henri Leconte won four. 

Therefore, Henri Leconte won more Grand Slam titles.

```

Verify Label
```

Hallucination: Yes

```


Verify Response
```

The answer introduces new information about Henri Leconte's Grand Slam titles, 

which is not present in the document.

```

## Non-RAG Example

Prompt

```

What are the ingredients that made the most expensive donut so expensive?

```

LLM Answer
```

The most expensive donut, called "The Golden Donut," was made expensive by several factors:



24-karat gold leaf: Used for decoration.



```

Verify Label
```

Hallucination: No

```


Verify Response
```

The response mentions 'The Golden Donut' and states that it was made expensive by several factors, 

including 24-karat gold leaf used for decoration. The search results provide information about expensive donuts, 

including one called 'The Golden Donut' or similar names, which are associated with luxurious ingredients 

like 24-karat gold and Cristal champagne. The response correctly identifies 24-karat gold leaf as 

a factor contributing to the donut's expensiveness, which is supported by multiple search results. 

While the response simplifies the information, it does not introduce factually incorrect 

or fabricated details about the donut's ingredients.

```