arxiv:2508.13470

STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

Published on Aug 19, 2025

Authors:

Abstract

STER-VLM enhances vision-language models for traffic analysis through caption decomposition, temporal frame selection, reference-driven understanding, and curated prompts, achieving improved semantic richness and real-world applicability.

AI-generated summary

Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS kong2024wts and BDD BDD datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.13470 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.13470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.13470 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.