Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation Paper • 2511.14993 • Published 7 days ago • 200
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Paper • 2511.15065 • Published 7 days ago • 71
VisPlay: Self-Evolving Vision-Language Models from Images Paper • 2511.15661 • Published 7 days ago • 41
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries Paper • 2511.14349 • Published 8 days ago • 16
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models Paper • 2511.16668 • Published 6 days ago • 52
Scaling Spatial Intelligence with Multimodal Foundation Models Paper • 2511.13719 • Published 9 days ago • 41
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models Paper • 2511.15605 • Published 7 days ago • 22
MiMo-Embodied: X-Embodied Foundation Model Technical Report Paper • 2511.16518 • Published 6 days ago • 23
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO Paper • 2511.16669 • Published 6 days ago • 30
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding Paper • 2511.16595 • Published 6 days ago • 9
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper • 2511.16334 • Published 6 days ago • 85
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models Paper • 2511.11007 • Published 12 days ago • 14
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination Paper • 2511.17490 • Published 5 days ago • 16