VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting Paper • 2510.21817 • Published Oct 21, 2025 • 42
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? Paper • 2509.03516 • Published Sep 3, 2025 • 12
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension Paper • 2503.08689 • Published Mar 11, 2025 • 4
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Paper • 2411.13093 • Published Nov 20, 2024 • 2