PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Abstract
PRISMM-Bench evaluates the ability of large multimodal models to detect, correct, and reason over inconsistencies in scientific papers, revealing significant challenges in multimodal scientific reasoning.
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
Community
Github repo: https://github.com/da-luggas/prismm-bench
Huggingface dataset repo: https://huggingface.co/datasets/wlin21at/PRISMM-Bench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ExpVid: A Benchmark for Experiment Video Understanding&Reasoning (2025)
- XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models (2025)
- UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG (2025)
- ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding (2025)
- BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models (2025)
- SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models (2025)
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper