Spaces:

kluster-ai
/

LLM-Hallucination-Detection-Leaderboard

Running

File size: 6,201 Bytes

9ad548c

<!--
keywords: RAG techniques, Retrieval-Augmented Generation prompt engineering, hallucination detection, LLM hallucination rate, kluster.ai, Verify API, prompt design comparison, large language model evaluation
-->
# Comparison of Retrieval-Augmented Generation (RAG) Prompting Techniques and Hallucinations

LLMs can generate fluent answers but still hallucinate facts, especially in Retrieval-Augmented Generation (RAG) workflows. This leaderboard aims to understand how different prompt engineering strategies impact hallucination rates across models. In other words: Which prompt format is most reliable? Which models are more sensitive to prompt structure? The goal is to inform better design of RAG pipelines so you can reduce factual errors in downstream applications.

We present hallucination rates for various LLMs under three RAG request strategies. Each method delivers the same document context and question, but differs in how the information is structured during the request.

## Overview

- **What we measure**: Hallucination rate (%) across three RAG request patterns.
- **RAG patterns compared**:
    1) **System Prompt**: context is placed in the system message; user sends only the question.
    2) **Single-Turn**: one user message that includes both the context *and* the question.
    3) **Two-Turn**: first user message provides the context, a second user message provides the question.
- **Why it matters**: Request structure can change reliability significantly. Knowing the safest default helps you ship trustworthy RAG systems faster.
- **Detect & reduce hallucinations**: The same [Verify](https://platform.kluster.ai/verify) API used for these evaluations can be plugged into your pipeline to flag and filter ungrounded answers in real time.
- **How to read the charts**: Lower bars = fewer hallucinations. Error bars show ±1 SD across models.
- **Experiment summary**: 10,000 HaluEval-QA examples, temperature 0, judged with [Verify](https://docs.kluster.ai/get-started/verify/overview/).

### RAG Techniques Evaluated

**1. RAG with Context in System Prompt**  
The document is embedded inside the system prompt, and the user sends only the question:
```text
[System]: You are an assistant for question-answering tasks. 
        Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT. 
        You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge. 
        The ANSWER also must not contradict information provided in the DOCUMENT. 
        If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know. 
        You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else. 
        Use three sentences maximum and keep the answer concise.
        DOCUMENT: <context>  

[User]: <prompt>
```

**2. RAG with Context and Question in Single-Turn**  
Both the document and question are concatenated in a single user message:
```text
[System]: You are an assistant for question-answering tasks. 
        Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT. 
        You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge. 
        The ANSWER also must not contradict information provided in the DOCUMENT. 
        If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know. 
        You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else. 
        Use three sentences maximum and keep the answer concise.
        
[User]: 
DOCUMENT: <context>  
QUESTION: <prompt>

```

**3. RAG with Context and Question in Two-Turns**  
The document and question are sent in separate user messages:
```text
[System]: You are an assistant for question-answering tasks. 
        Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT. 
        You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge. 
        The ANSWER also must not contradict information provided in the DOCUMENT. 
        If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know. 
        You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else. 
        Use three sentences maximum and keep the answer concise.

[User]: DOCUMENT: <context>  
[User]: QUESTION: <prompt>
```
*Note: This method did **not** work on Gemma 3 27B with the default chat template due to its restriction on consecutive user messages without an intervening assistant response.*

### Dataset  
We evaluate all three prompting strategies on the **HaluEval QA** benchmark, a large-scale collection of RAG question-answer examples.  
- **Source**: [HaluEval QA](https://huggingface.co/datasets/pminervini/HaluEval/viewer/qa?views%5B%5D=qa)  
- **Size**: 10,000 question-document pairs  
- **Content**: Each example contains a short passage (extracted primarily from Wikipedia-style articles) and an accompanying question that can be answered **only** from that passage.  
- **Use case**: Designed to measure whether an LLM can remain faithful to supplied context without inventing new facts.  

All prompts are generated with *temperature = 0* to remove randomness so that differences in hallucination rate stem solely from the prompt format.

### Metric

The values in the table indicate the **hallucination rate (%)** of answers deemed factually incorrect or ungrounded given the provided context.

Hallucination rates are automatically computed using **[Verify](https://platform.kluster.ai/verify)** by [kluster.ai](https://kluster.ai/), the [leading](https://www.kluster.ai/blog/introducing-verify-by-kluster-ai-the-missing-trust-layer-in-your-ai-stack) AI-powered hallucination detection API that cross-checks model claims against the source document.