Papers
arxiv:2603.16859

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Published on Mar 17
· Submitted by
Huang
on Mar 18
Authors:
,
,
,
,
,

Abstract

SocialOmni presents a benchmark for evaluating social interactivity in omni-modal large language models across speaker identification, interruption timing, and natural interruption generation, revealing gaps between perceptual accuracy and conversational competence.

AI-generated summary

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

Community

Paper author Paper submitter
•
edited about 12 hours ago

New OmniModel benchmark on social interaction.

🔗Github: github.com/MAC-AutoML/SocialOmni
🔗Dataset: huggingface.co/datasets/alexisty/SocialOmni

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.16859 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.16859 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.16859 in a Space README.md to link it from this page.

Collections including this paper 1