aloe-vera's picture
add more details
c7b7b88 verified
|
raw
history blame
6.4 kB

About

As large language models (LLMs) continue to improve, evaluating how well they avoid hallucinations (producing information that is unfaithful or factually incorrect) has become increasingly important. While many models claim to be reliable, their factual grounding can vary significantly across tasks and settings.

This leaderboard provides a standardised evaluation of how different LLMs perform on hallucination detection tasks. Our goal is to help researchers and developers understand which models are more trustworthy in both grounded (context-based) and open-ended (real-world knowledge) settings. We use Verify by kluster.ai, an automated hallucination detection tool, to evaluate the factual consistency of model outputs.


Tasks

We evaluate each model using two benchmarks:

Retrieval-Augmented Generation (RAG setting)

RAG evaluates how well a model stays faithful to a provided context when answering a question. The input consists of a synthetic or real context paired with a relevant question. Models are expected to generate answers using only the information given, without adding external knowledge or contradicting the context.

  • Source: HaluEval QA
  • Dataset Size: 10,000 question-context pairs
  • Prompt Format: Prompt with relevant context document
  • Temperature: 0 (to enforce deterministic, grounded outputs)
  • System Prompt: Instructs the model to only use the document and avoid guessing.

System prompt

This is the system prompt use to generate LLM output for RAG setting:

You are an assistant for question-answering tasks.  
Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT.  
You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge.  
The ANSWER also must not contradict information provided in the DOCUMENT.  
If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know.  
You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else.  
Use three sentences maximum and keep the answer concise.

Prompt format

Each prompt is formatted as

DOCUMENT:
{context}

QUESTION:
{question}

Message structure

The models use the following message structure:

messages = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},]

Real-World Knowledge (Non-RAG setting)

This setting evaluates how factually accurate a model is when no context is provided. The model must rely solely on its internal knowledge to answer a broad range of user questions across many topics. The answers are then verified using web search to determine factual correctness.

  • Source: Filtered from UltraChat prompts
  • Dataset Size: 11,746 single-turn user queries
  • Prompt Format: Single user prompt without additional context
  • Temperature: 1 (to reflect natural, fluent generation)
  • System Prompt: Encourages helpfulness, accuracy, and honesty when unsure.

System prompt

This is the system prompt use to generate LLM output for Non-RAG setting:

You are a helpful, factual, and concise assistant. 
Always try to answer the user's question clearly and completely. 
Do not make up information. If you are unsure or lack the knowledge, say so.

Message structure

The message structure for Non-RAG setting is the same as RAG setting.


Evaluation Method

We use Verify, a hallucination detection tool built by kluster.ai, to classify model outputs:

  • In the RAG setting, Verify checks if the output contradicts, fabricates, or strays from the input document.
  • In the real-world knowledge setting, Verify uses search queries to fact-check the answer based on current, public information.

Each model's hallucination rate is computed as:

Hallucination Rate = (Number of hallucinated outputs) / (Total number of prompts)

A lower hallucination rate indicates better performance.


Example of Hallucination Detection

RAG Example

Context

Jonathan Stark (born April 3, 1971) is a former professional tennis player from the United States. 
During his career he won two Grand Slam doubles titles (the 1994 French Open Men's Doubles and 
the 1995 Wimbledon Championships Mixed Doubles). He reached the men's singles final at the French Open in 1988, 
won the French Open men's doubles title in 1984, and helped France win the Davis Cup in 1991.

Prompt

Which tennis player won more Grand Slam titles, Henri Leconte or Jonathan Stark?

LLM Answer

Jonathan Stark won two Grand Slam titles, while Henri Leconte won four. 
Therefore, Henri Leconte won more Grand Slam titles.

Verify Label

Hallucination: Yes

Verify Response

The answer introduces new information about Henri Leconte's Grand Slam titles, 
which is not present in the document.

Non-RAG Example

Prompt

What are the ingredients that made the most expensive donut so expensive?

LLM Answer

The most expensive donut, called "The Golden Donut," was made expensive by several factors:

24-karat gold leaf: Used for decoration.

Verify Label

Hallucination: No

Verify Response

The response mentions 'The Golden Donut' and states that it was made expensive by several factors, 
including 24-karat gold leaf used for decoration. The search results provide information about expensive donuts, 
including one called 'The Golden Donut' or similar names, which are associated with luxurious ingredients 
like 24-karat gold and Cristal champagne. The response correctly identifies 24-karat gold leaf as 
a factor contributing to the donut's expensiveness, which is supported by multiple search results. 
While the response simplifies the information, it does not introduce factually incorrect 
or fabricated details about the donut's ingredients.