arxiv:2604.04184

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Published on Apr 5

· Submitted by

Jinpeng Chen on Apr 7

Upvote

Authors:

Xudong Lu ,

Abstract

AURA is an end-to-end streaming visual interaction framework that enables continuous video stream processing with real-time question answering and proactive responses through integrated context management and optimized deployment.

AI-generated summary

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

View arXiv page View PDF Project page GitHub 29 Add to collection

Community

jinpeng0528

Paper submitter about 15 hours ago

🔥 We've open-sourced the model weights and demo code. Feel free to try them out!

avahal

about 3 hours ago

the most interesting bit to me is how aura handles long-horizon context with a smart cache and a tiny persistent memory module, instead of reprocessing everything every frame. they fuse streaming data construction with a unified model so the memory supports both real-time q&a and long-horizon interaction, which is nontrivial under strict latency constraints. the design of the retrieval and cache policy, especially how they decide what to keep, reuse, or drop across scenes, seems to be what actually drives the latency and throughput gains. i worry a bit about edge cases like rapid scene changes or bursts of occlusion that could stress the cache in ways the benchmarks don't capture. btw the arxivlens breakdown helped me parse the method details, if you skim the paper the walkthrough is pretty solid: https://arxivlens.com/PaperView/Details/aura-always-on-understanding-and-real-time-assistance-via-video-streams-57-a3d13574