ReviewScore: Misinformed Peer Review Detection with Large Language Models
Abstract
An automated engine evaluates the factuality of review points in AI conference papers, demonstrating moderate agreement with human experts and higher accuracy at the premise level.
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
Community
We introduce ReviewScore, a new evaluation of peer review quality, focusing on detecting misinformed review points.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors (2025)
- Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers (2025)
- Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback (2025)
- ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks (2025)
- Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework (2025)
- ReviewRL: Towards Automated Scientific Review with RL (2025)
- CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper