AURA: Always-On Understanding and Real-Time Assistance via Video Streams
Abstract
AURA is an end-to-end streaming visual interaction framework that enables continuous video stream processing with real-time question answering and proactive responses through integrated context management and optimized deployment.
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Community
🔥 We've open-sourced the model weights and demo code. Feel free to try them out!
the most interesting bit to me is how aura handles long-horizon context with a smart cache and a tiny persistent memory module, instead of reprocessing everything every frame. they fuse streaming data construction with a unified model so the memory supports both real-time q&a and long-horizon interaction, which is nontrivial under strict latency constraints. the design of the retrieval and cache policy, especially how they decide what to keep, reuse, or drop across scenes, seems to be what actually drives the latency and throughput gains. i worry a bit about edge cases like rapid scene changes or bursts of occlusion that could stress the cache in ways the benchmarks don't capture. btw the arxivlens breakdown helped me parse the method details, if you skim the paper the walkthrough is pretty solid: https://arxivlens.com/PaperView/Details/aura-always-on-understanding-and-real-time-assistance-via-video-streams-57-a3d13574
Get this paper in your agent:
hf papers read 2604.04184 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper