RISE-Video: Can Video Generators Decode Implicit World Rules?
Abstract
RISE-Video presents a novel benchmark for evaluating text-image-to-video synthesis models based on cognitive reasoning rather than visual fidelity, using a multi-dimensional metric system and automated LMM-based evaluation.
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
Community
Despite strong visual realism, we find that current text-image-to-video models frequently fail to respect implicit world rules when generating complex scenarios. We introduce RISE-Video to systematically evaluate reasoning fidelity in video generation and reveal persistent reasoning gaps across state-of-the-art models.
Code: https://github.com/VisionXLab/Rise-Video
Data: https://huggingface.co/datasets/VisionXLab/RISE-Video
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/rise-video-can-video-generators-decode-implicit-world-rules-4136-2f194534
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation (2025)
- ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning (2025)
- Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning (2025)
- SVBench: Evaluation of Video Generation Models on Social Reasoning (2025)
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing (2026)
- MMGR: Multi-Modal Generative Reasoning (2025)
- TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper