The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
Abstract
Large audio-language models are enhanced with spatial understanding through a hierarchical framework, scalable simulation pipeline, unified model architecture, and progressive training approach for comprehensive audio scene analysis.
Existing large audio-language models perceive the world as "mono"-a single stream of audio that ignores the critical spatial dimension ("where") required for universal audio scene analysis (ASA). To bridge this gap, we first introduce a hierarchical framework for audio scene analysis. Guided by this framework, we introduce a system that enables large audio-language models (LALMs) to understand and reason about the complex acoustic world. Our system endows LALMs with universal spatial understanding through four key innovations: (1) A scalable simulation pipeline that synthesizes high-quality First-Order-Ambisonics(FOA) data; (2) A unified model framework that integrates universal spatial encoding with a dense hybrid projection mechanism to bridge the modality gap; (3) A progressive training curriculum that evolves from representation alignment to reinforcement learning-based reasoning; and (4) A comprehensive benchmark for audio scene analysis (ASA) designed to rigorously evaluate atomic perception, relational integration, and cognitive reasoning capabilities, on which our model demonstrates comparatively strong capability for spatial understanding. Our work provides a clear pathway for leveraging the powerful reasoning abilities of LALMs towards holistic ASA, advancing from "mono" semantic recognition to spatial intelligence.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper