arielgera commited on
Commit
0278d2c
·
1 Parent(s): 526b78c

add description

Browse files
Files changed (1) hide show
  1. app.py +31 -0
app.py CHANGED
@@ -42,3 +42,34 @@ styled_data = (
42
 
43
 
44
  st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
 
44
  st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)
45
+
46
+ st.text("\n\n")
47
+ st.markdown(
48
+ r"""
49
+ This leaderboard measures the **system-level performance and behavior of LLM judges**, and was created as part of the **[JuStRank paper](https://www.arxiv.org/abs/2412.09569)** from ACL 2025.
50
+
51
+ Judges are sorted according to *Ranking Agreement* with humans, i.e., comparing how the judges rank different systems (generative models) relative to how humans rank those systems on [LMSys Arena](https://lmarena.ai/leaderboard/text/hard-prompts-english).
52
+
53
+ We also compare judges in terms of the *Decisiveness* and *Bias* reflected in their judgment behaviors (refer to the paper for details).
54
+
55
+ In our research we tested 10 **LLM judges** and 8 **reward models**, and asked them to score the [responses](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto/tree/main/data/arena-hard-v0.1/model_answer) of 63 systems to the 500 questions from Arena Hard v0.1.
56
+ For each LLM judge we tried 4 different _realizations_, i.e., different prompt and scoring methods used with the LLM judge.
57
+
58
+ In total, the judge ranking is derived from **[1.5 million raw judgment scores](https://huggingface.co/datasets/ibm-research/justrank_judge_scores)** (48 judge realizations X 63 target systems X 500 instances).
59
+
60
+ If you find this useful, please cite our work 🤗
61
+
62
+ ```bibtex
63
+ @inproceedings{gera2025justrank,
64
+ title={JuStRank: Benchmarking LLM Judges for System Ranking},
65
+ author={Gera, Ariel and Boni, Odellia and Perlitz, Yotam and Bar-Haim, Roy and Eden, Lilach and Yehudai, Asaf},
66
+ booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
67
+ month={july},
68
+ address={Vienna, Austria},
69
+ year={2025}
70
+ url={www.arxiv.org/abs/2412.09569},
71
+ }
72
+ ```
73
+ """
74
+ )
75
+