TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Abstract
TrustJudge, a probabilistic framework, addresses inconsistencies in LLM-as-a-judge evaluation by using distribution-sensitive scoring and likelihood-aware aggregation, improving accuracy and reliability.
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
Community
TrustJudge is a novel probabilistic framework that addresses inconsistencies in large language models used for automated evaluation by enhancing scoring granularity and resolving transitivity violations, thereby improving the reliability and accuracy of assessments without requiring additional training or human intervention.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations (2025)
- Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution (2025)
- Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation (2025)
- Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation (2025)
- Bridging Human and LLM Judgments: Understanding and Narrowing the Gap (2025)
- Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge (2025)
- Are Today's LLMs Ready to Explain Well-Being Concepts? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper