A newer version of the Gradio SDK is available:
5.39.0
The LLM Hallucination Detection Leaderboard is a public, continuously updated comparison of how well popular Large Language Models (LLMs) avoid hallucinations, responses that are factually incorrect, fabricated, or unsupported by evidence. By surfacing transparent metrics across tasks, we help practitioners choose models that they can trust in production.
Why does hallucination detection matter?
- User Trust & Safety: Hallucinations undermine confidence and can damage reputation.
- Retrieval-Augmented Generation (RAG) Quality: In enterprise workflows, LLMs must remain faithful to supplied context. Measuring hallucinations highlights which models respect that constraint.
- Regulatory & Compliance Pressure: Upcoming AI regulations require demonstrable accuracy standards. Reliable hallucination metrics can help you meet these requirements.
How we measure hallucinations
We evaluate each model on two complementary benchmarks and compute a hallucination rate (lower = better):
- HaluEval-QA (RAG setting): Given a question and a supporting document, the model must answer only using the provided context.
- UltraChat Filtered (Non-RAG setting): Open-domain questions with no extra context test the model's internal knowledge.
Outputs are automatically verified by Verify from kluster.ai, which cross-checks claims against the source document or web results.
Note: Full experiment details, including prompt templates, dataset description, and evaluation methodology, are provided at the end of this page for reference.
Stay informed as we add new models and tasks, and follow us on X or join Discord here for the latest updates on trustworthy LLMs.