InteractiveOmni

InteractiveOmni-4B ๐Ÿค—  | InteractiveOmni-8B ๐Ÿค—  | ๐Ÿ“‘ Paper   

Introduction

InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech streams, achieving truly integrated interaction.

This is the schematic diagram for multi-turn audio-visual interaction.

Key Features

  • Strong Performance Across Modalities: Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models.
  • State-of-the-Art Performance: Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation.
  • Excellent Interactive Performance: Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities.
  • Multi-turn Interactive Benchmarks: Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs.
  • On-device Model: the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model.

Model Architecture

Quickstart

Get the Code

git clone https://github.com/SenseTime-FVG/InteractiveOmni.git
cd InteractiveOmni
pip install -r requirements.txt

We provide an example code to run InteractiveOmni using ๐Ÿค— Transformers.

Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally.

Model Loading

import torch
from transformers import AutoTokenizer, AutoModel
path = "sensefvg/InteractiveOmni-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).eval().cuda()

Inference with Transformers

import torch
from transformers import AutoModel, AutoTokenizer
import torchaudio

path = "sensefvg/InteractiveOmni-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True)

# set the max number of tiles in `max_num`
max_num = 12
frame = 8
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (็บฏๆ–‡ๆœฌๅฏน่ฏ)
messages = [
    {
        'role': "user",
        'content': 'Hello, who are you?',
    }
]
response = model.chat(tokenizer, generation_config, messages)

# audio conversation (้Ÿณ้ข‘ๅฏน่ฏ)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_en.wav"
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages)

## Generate both audio and text output
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav"
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")

# image-text conversation (ๅ›พๆ–‡ๅฏน่ฏ)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "text",
                "text": "Please describe the image shortly."
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num)

# image-audio conversation (ๅ›พ้Ÿณๅฏน่ฏ)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "audio",
                "audio": "assets/describe_img_en.wav"
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num)

## image-audio conversation, generate both audio and text output
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "audio",
                "audio": "assets/describe_img_en.wav"
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")

# video conversation (่ง†้ข‘ๅฏน่ฏ)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "video",
                "video": 'video_path'
            },
            {
                "type": "text",
                "text": "Describe this video in detail."
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num, frame)

Use audio output

  • If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected.
You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech.
messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result_none_speaker.wav", wav_response.cpu(), 24000, format="wav")
  • Use default speaker to generate output audio.
messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=model.default_speaker_embedding)
torchaudio.save("result_default_speaker.wav", wav_response.cpu(), 24000, format="wav")
  • Use custom speaker to generate output audio, similar to sound cloning.
messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
speaker_embedding = model.extract_speaker_embedding("assets/hello_zh.wav")
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=speaker_embedding)
torchaudio.save("result_custom_speaker.wav", wav_response.cpu(), 24000, format="wav")

Evaluation

InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks.

Image Understanding
Model MMBench MMStar MMMU MathVista HallusionBench AI2D OCRBench Avg
Vision-Language Model
InternVL3-8B 82.1 68.7 62.2 70.5 49.0 85.1 88.4 72.3
InternVL3.5-8B 79.5 69.3 73.4 78.4 54.5 84.0 84.0 74.7
Qwen2.5-VL-7B 82.2 64.1 58.0 68.1 51.9 84.3 88.8 71.1
Omni Model
GPT-4o-mini 76.0 54.8 60.0 52.5 46.1 77.8 78.5 63.7
VITA-1.5 76.8 60.2 52.6 66.2 44.6 79.2 74.1 64.8
Ming-Lite-Omni 80.8 64.7 56.3 71.6 55.0 83.1 88.4 71.4
Qwen2.5-Omni-7B 81.3 64.0 59.2 67.9 47.4 83.2 83.4 69.5
InteractiveOmni-4B 78.9 62.6 61.1 61.7 52.2 83.8 80.0 68.6
InteractiveOmni-8B 81.4 66.8 66.9 68.0 61.3 84.3 83.7 73.2
Video Understanding
Model Video-MME
(wo sub)
Video-MME
(w sub)
MLVU
(M-Avg)
LongVideoBench
(val total)
Avg
Vision-Language Model
InternVL3-8B 66.3 68.9 71.4 58.8 66.4
InternVL3.5-8B 66.0 68.6 70.2 62.1 66.7
Qwen2.5-VL-7B 65.1 71.6 70.2 56.0 64.5
Omni Model
GPT-4o-mini 64.8 - - - -
Qwen2.5-Omni-7B 64.3 72.4 - - -
InteractiveOmni-4B 63.3 69.3 68.0 57.0 64.4
InteractiveOmni-8B 66.0 71.8 71.6 59.1 67.1
Audio Understanding
Model Qwen2-Audio Step-Audio-Chat Kimi-Audio Qwen2.5-Omni-7B InteractiveOmni-4B InteractiveOmni-8B
ASR (wer)
Wenetspeech
test-net
10.60 8.75 5.37 5.90 5.40 5.04
Wenetspeech
test-meeting
10.68 9.52 6.28 7.70 6.95 5.55
LibriSpeech
test-clean
1.60 3.19 1.28 1.80 1.73 1.64
LibriSpeech
test-other
3.60 10.67 2.42 3.40 3.69 3.41
Aishell-2 IOS 4.48 3.57 2.56 2.56 2.85 2.18
ChildMandarin 14.62 - - 19.34 17.21 14.03
Audio Understanding
MMAU 56.60 - 65.20 65.60 72.00 67.39
MELD 55.30 33.54 59.13 57.00 57.16 57.55
ClothoAQA
dev
72.63 44.98 73.18 73.12 71.91 72.98
ClothoAQA
test
71.73 45.84 71.24 72.86 71.28 74.49
Omni-modal Understanding
Model Speech Sound Event Music Avg
OmniBench
MiniCPM-o-2.6 - - - 40.50
Baichuan-Omni-1.5 - - - 42.90
Qwen2.5-Omni-7B 55.25 60.00 52.83 56.13
InteractiveOmni-4B 60.70 61.51 42.45 59.19
InteractiveOmni-8B 60.18 62.64 55.66 60.33
Speech-to-text
Datasets Model Performance
OpenAudioBench
Reasoning QA | Llama Questions
| Web Questions | TriviaQA
| AlpacaEval | Avg
Qwen2-Audio 42.77 | 69.67 | 45.20 | 40.30 | 57.19 | 51.03
GLM-4-Voice 47.43 | 76.00 | 55.40 | 51.80 | 57.89 | 57.70
VITA-1.5 41.00 | 74.20 | 57.30 | 46.80 | 68.20 | 57.50
Step-Audio-chat 60.00 | 72.33 | 73.00 | 56.80 | 56.53 | 63.73
Baichuan-Audio 41.90 | 78.40 | 64.50 | 61.70 | 77.40 | 64.78
Kimi-Audio 58.02 | 79.33 | 70.20 | 62.10 | 75.73 | 69.08
MiniCPM-o-2.6 38.60 | 77.80 | 68.60 | 61.90 | 51.80 | 59.74
Baichuan-Omni-1.5 50.00 | 78.50 | 59.10 | 57.20 | 77.90 | 64.54
Qwen2.5-Omni-7B 63.76 | 75.33 | 62.80 | 57.06 | 72.76 | 66.34
InteractiveOmni-4B 69.11 | 79.33 | 65.80 | 56.40 | 74.87 | 69.10
InteractiveOmni-8B 71.68 | 80.67 | 70.30 | 66.50 | 74.57 | 72.74
VoiceBench
AlpacaEval | CommonEval
| WildVoice | SD-QA | MMSU
Qwen2-Audio 3.69 | 3.40 | 3.01 | 35.35 | 35.43
GLM-4-Voice 4.06 | 3.48 | 3.18 | 43.31 | 40.11
VITA-1.5 4.21 | 3.66 | 3.48 | 38.88 | 52.15
Step-Audio-chat 3.99 | 2.99 | 2.93 | 46.84 | 28.72
Baichuan-Audio 4.41 | 4.08 | 3.92 | 45.84 | 53.19
Kimi-Audio 4.46 | 3.97 | 4.20 | 63.12 | 62.17
MiniCPM-o-2.6 4.42 | 4.15 | 3.94 | 50.72 | 54.78
Baichuan-Omni-1.5 4.50 | 4.05 | 4.06 | 43.40 | 57.25
Qwen2.5-Omni-7B 4.50 | 3.84 | 3.89 | 56.40 | 61.32
InteractiveOmni-4B 4.27 | 4.20 | 3.94 | 41.41 | 63.24
InteractiveOmni-8B 4.61 | 4.34 | 4.21 | 44.67 | 65.26
VoiceBench
OpenBookQA | IFEval
| BBH | AdvBench | Avg
Qwen2-Audio 49.01 | 54.70 | 22.57 | 98.85 | 55.32
GLM-4-Voice 52.97 | 52.80 | 24.91 | 88.08 | 57.40
VITA-1.5 71.65 | 55.30 | 38.14 | 97.69 | 64.53
Step-Audio-chat 31.87 | 50.60 | 29.19 | 65.77 | 50.13
Baichuan-Audio 71.65 | 54.80 | 50.31 | 99.42 | 69.27
Kimi-Audio 83.52 | 69.70 | 61.10 | 100.0 | 76.91
MiniCPM-o-2.6 78.02 | 60.40 | 49.25 | 97.69 | 71.23
Baichuan-Omni-1.5 74.51 | 62.70 | 54.54 | 97.31 | 71.32
Qwen2.5-Omni-7B 80.90 | 66.70 | 53.50 | 99.20 | 73.60
InteractiveOmni-4B 82.64 | 55.90 | 60.90 | 99.62 | 73.10
InteractiveOmni-8B 86.37 | 73.30 | 57.99 | 99.42 | 76.69
Speech Generation
Model test-zh test-en test-zh-hard
TTS Model
MaskGCT 2.27 2.62 10.27
SeedTTS 1.12 2.25 7.59
CosyVoice 2 1.45 2.57 6.83
MLLM
MinMo 2.48 2.90 -
Ming-Lite-Omni 1.69 4.31 -
Qwen2.5-Omni-7B 1.70 2.72 7.97
InteractiveOmni-4B 1.37 3.73 8.02
InteractiveOmni-8B 1.56 2.33 7.92

Citation

If you find our paper and code useful in your research, please cite our technical report.

@misc{tong2025interactiveomniunifiedomnimodalmodel,
      title={InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue}, 
      author={Wenwen Tong and Hewei Guo and Dongchuan Ran and Jiangnan Chen and Jiefan Lu and Kaibin Wang and Keqiang Li and Xiaoxu Zhu and Jiakui Li and Kehan Li and Xueheng Li and Lumin Li and Chenxu Guo and Jiasheng Zhou and Jiandong Chen and Xianye Wu and Jiahao Wang and Silei Wu and Lei Chen and Hanming Deng and Yuxuan Song and Dinghao Zhou and Guiping Zhong and Ken Zheng and Shiyin Kang and Lewei Lu},
      year={2025},
      eprint={2510.13747},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13747}, 
}
Downloads last month
23
Safetensors
Model size
10B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support