Comparison of Retrieval-Augmented Generation (RAG) Prompting Techniques and Hallucinations

LLMs can generate fluent answers but still hallucinate facts, especially in Retrieval-Augmented Generation (RAG) workflows. This leaderboard aims to understand how different prompt engineering strategies impact hallucination rates across models. In other words: Which prompt format is most reliable? Which models are more sensitive to prompt structure? The goal is to inform better design of RAG pipelines so you can reduce factual errors in downstream applications.

We present hallucination rates for various LLMs under three RAG request strategies. Each method delivers the same document context and question, but differs in how the information is structured during the request.

Overview

What we measure: Hallucination rate (%) across three RAG request patterns.
RAG patterns compared:
1. System Prompt: context is placed in the system message; user sends only the question.
2. Single-Turn: one user message that includes both the context and the question.
3. Two-Turn: first user message provides the context, a second user message provides the question.
Why it matters: Request structure can change reliability significantly. Knowing the safest default helps you ship trustworthy RAG systems faster.
Detect & reduce hallucinations: The same Verify API used for these evaluations can be plugged into your pipeline to flag and filter ungrounded answers in real time.
How to read the charts: Lower bars = fewer hallucinations. Error bars show ±1 SD across models.
Experiment summary: 10,000 HaluEval-QA examples, temperature 0, judged with Verify.

Note: Full experiment details, including prompt templates, dataset description, and evaluation methodology, are provided at the end of this page for reference.