Spaces:

kluster-ai
/

LLM-Hallucination-Detection-Leaderboard

Running

App Files Files Community

prerelease-tweaks

by rymc - opened Jul 7

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+99

-12

Files changed (3) hide show

app.py +7 -9
src/populate.py +17 -2
submit.md +75 -1

app.py CHANGED Viewed

@@ -161,21 +161,19 @@ demo = gr.Blocks(css=custom_css)
 with demo:
     gr.HTML(f"""
         <div style="text-align: center; margin-top: 2em; margin-bottom: 1em;">
-            <img src="data:image/png;base64,{b64_string}" alt="KlusterAI logo"
                 style="height: 80px; display: block; margin-left: auto; margin-right: auto;" />
             <div style="font-size: 2.5em; font-weight: bold; margin-top: 0.4em; color: var(--text-color);">
-                LLM Hallucination Detection <span style="color: var(--link-color);">Leaderboard</span>
             </div>
-            <div style="font-size: 1.5em; margin-top: 0.5em; color: var(--text-color);">
                 Evaluating factual accuracy and faithfulness of LLMs in both RAG and real-world knowledge settings with
-                <a href="https://platform.kluster.ai/verify" target="_blank"
-                style="color: var(--link-color); text-decoration: none;">
                     Verify
                 </a> by
-                <a href="https://platform.kluster.ai/" target="_blank"
-                style="color: var(--link-color); text-decoration: none;">
                     kluster.ai
                 </a>
             </div>
@@ -211,10 +209,10 @@ with demo:
             # ----------  Leaderboard  ----------
             leaderboard = init_leaderboard(LEADERBOARD_DF)
-        with gr.TabItem("📝 Document", elem_id="llm-benchmark-tab-table", id=2):
             gr.Markdown((Path(__file__).parent / "docs.md").read_text())
-        with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
             gr.Markdown((Path(__file__).parent / "submit.md").read_text())
             # with gr.Column():

 with demo:
     gr.HTML(f"""
         <div style="text-align: center; margin-top: 2em; margin-bottom: 1em;">
+            <img src="data:image/png;base64,{b64_string}" alt="kluster.ai logo"
                 style="height: 80px; display: block; margin-left: auto; margin-right: auto;" />
             <div style="font-size: 2.5em; font-weight: bold; margin-top: 0.4em; color: var(--text-color);">
+                LLM Hallucination Detection Leaderboard
             </div>
+            <div style="font-size: 1.5em; margin-top: 0.5em;">
                 Evaluating factual accuracy and faithfulness of LLMs in both RAG and real-world knowledge settings with
+                <a href="https://platform.kluster.ai/verify" target="_blank">
                     Verify
                 </a> by
+                <a href="https://platform.kluster.ai/" target="_blank">
                     kluster.ai
                 </a>
             </div>
             # ----------  Leaderboard  ----------
             leaderboard = init_leaderboard(LEADERBOARD_DF)
+        with gr.TabItem("📝 Details", elem_id="llm-benchmark-tab-table", id=2):
             gr.Markdown((Path(__file__).parent / "docs.md").read_text())
+        with gr.TabItem("🚀 Submit Here! ", elem_id="llm-benchmark-tab-table", id=3):
             gr.Markdown((Path(__file__).parent / "submit.md").read_text())
             # with gr.Column():

src/populate.py CHANGED Viewed

@@ -33,8 +33,23 @@ def get_leaderboard_df(results_path):
     medal_map  = {1: "🥇", 2: "🥈", 3: "🥉"}
     def medal_html(rank):
-        m = medal_map.get(rank)
-        return f'<span style="font-size:2.0rem;">{m}</span>' if m else rank
     df["Rank"] = df.index + 1
     df["Rank"] = df["Rank"].apply(medal_html)

     medal_map  = {1: "🥇", 2: "🥈", 3: "🥉"}
     def medal_html(rank):
+        """Return an HTML span with the medal icon for the top 3 ranks.
+        The numeric rank is stored in the data-order attribute equal to the numerical rank so that
+        DataTables (used under-the-hood by the gradio_leaderboard component)
+        can sort the column by this hidden numeric value while still
+        displaying the pretty medal icon. For ranks > 3 we just return the
+        integer so the column remains fully numeric.
+        """
+        medal = medal_map.get(rank)
+        if medal:
+            # Prepend a hidden numeric span so string sorting still works numerically.
+            return (
+                f'<span style="display:none">{rank:04}</span>'  # zero-padded for stable string sort
+                f'<span style="font-size:2.0rem;">{medal}</span>'
+            )
+        # For other ranks, also zero-pad to keep width and ensure proper string sort
+        return f'<span style="display:none">{rank:04}</span>{rank}'
     df["Rank"] = df.index + 1
     df["Rank"] = df["Rank"].apply(medal_html)

submit.md CHANGED Viewed

	@@ -1 +1,75 @@
1	- # If ~~you~~ ~~are~~ ~~interested,~~ ~~please~~ ~~submit here ...~~

+# LLM Hallucination Detection Leaderboard Submission Guidelines
+Thank you for your interest in contributing to the **LLM Hallucination Detection Leaderboard**!  We welcome submissions from researchers and practitioners who have built or finetuned language models that can be evaluated on our hallucination benchmarks.
+---
+## 1. What to Send
+Please email **ryan@kluster.ai** with the subject line:
+```
+[Verify Leaderboard Submission] <Your-Model-Name>
+```
+Attach **one ZIP file** that contains **all of the following**:
+1. **`model_card.md`** – A short Markdown file describing your model:
+   • Name and version
+   • Architecture / base model
+   • Training or finetuning procedure
+   • License
+   • Intended use & known limitations
+   • Contact information
+2. **`results.csv`** – A CSV file with **one row per prompt** and **one column per field** (see schema below).
+3. (Optional) **`extra_notes.md`** – Anything else you would like us to know (e.g., additional analysis).
+---
+## 2. CSV Schema
+| Column             | Description                                                               |
+|--------------------|---------------------------------------------------------------------------|
+| `request`          | The exact input prompt shown to the model.                                |
+| `response`         | The raw output produced by the model.                                     |
+| `verify_response`  | The  Verify judgment or explanation regarding hallucination.      |
+| `verify_label`     | The final boolean / categorical label (e.g., `TRUE`, `FALSE`).   |
+| `task`             | The benchmark or dataset name the sample comes from.            |
+**Important:**  Use UTF-8 encoding and **do not** add additional columns without prior discussion; extra information should go in the `metadata` field. You must use Verify by kluster.ai to ensure fairness in the leaderboard.
+---
+## 3. Evaluation Datasets
+Run your model on the following public datasets and include *all* examples in your CSV.  You can load them directly from Hugging Face:
+| Dataset | Hugging Face Link |
+|---------|-------------------|
+| HaluEval QA (qa_samples subet with Question and Knowledge column) | https://huggingface.co/datasets/pminervini/HaluEval |
+| UltraChat | https://huggingface.co/datasets/kluster-ai/ultrachat-sampled |
+---
+## 5. Example Row
+```csv
+request,response,verify_response,verify_label,task
+"What is the capital of the UK?","London is the capital of the UK.","The statement is factually correct.",CORRECT,TruthfulQA
+```
+---
+## 6. Review Process
+1. We will sanity-check the file format and reproduce a random subset.
+2. If everything looks good, your scores will appear on the public leaderboard.
+3. We may reach out for clarifications, please keep an eye on your inbox.
+---
+## 7. Contact
+Questions?  Email **ryan@kluster.ai**.
+We look forward to your submissions and to advancing reliable language models together!