Papers
arxiv:2604.21686

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Published on Apr 23
· Submitted by
taesiri
on Apr 24
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

WorldMark establishes a standardized benchmark for evaluating interactive video generation models with unified controls, identical scenarios, and comprehensive evaluation metrics across multiple model architectures.

AI-generated summary

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

Community

one burning question for me: how well does the unified WASD+L/R action layer generalize to models whose native controls are continuous, nonuniform, or otherwise non-discretizable?

an ablation i'd love to see is removing the adapter and feeding each model directly with its own controls to isolate how much the apples-to-apples comparison actually hinges on the interface versus the model's intrinsic dynamics.

the arXivLens breakdown helped me parse the method details and clarified how the VLM-based action filter and the 3D coherence checks interact across scenes.

overall this looks like a solid step toward reproducible cross model benchmarking, and i’m curious to see how it handles edge cases in future releases.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.21686
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.21686 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.21686 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.21686 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.