update paper link

7bc96b7 verified 17 days ago

8.86 kB

	---
	license: apache-2.0
	---


	<h1 align="center">Hello-Chat</h1>
	<h3 align="center">Towards Realistic Social Audio Interactions</h3>

	<p align="center">
	<a href='https://arxiv.org/abs/2602.23387'><img src='https://img.shields.io/badge/arXiv-2602.23387-b31b1b.svg'></a>
	<a href="https://github.com/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>
	<a href="https://huggingface.co/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
	</p>

	<p align="center">
	<img src="assets/img/model_architecture.png" width="100%" alt="Hello-Chat model architecture.">
	</p>

	## Hello-Chat

	Hello-Chat, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, Hello-Chat enables realistic, context-aware spoken interaction between users and AI.

	## 📊 Evaluation Results

	### Evaluation of Audio to Text

	#### Audio Understanding Evaluation
	ASR — Automatic speech recognition performance is evaluated on a balanced subset of AIShell, WeNet, and LibriSpeech, with Chinese and English samples evenly represented.<br>
	NLP Question — question-answering data sourced from AlpacaEval, LLaMA Questions, and Web Questions. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by GPT-5.<br>
	Translation — based on synthetic multilingual data generated by Claude and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by GPT-5.<br>
	MMAU — Audio-based question answering is evaluated using a subset of the MMAU-Mini benchmark.

	\| Model \| ASR ↓ \| NLP Question ↑ \| Translation ↑ \| MMAU ↑ \|
	\|---\|---\|---\|---\|---\|
	\| Gemini3-Preview \| 4.06 \| 8.85 \| 8.87 \| 0.75 \|
	\| GPT-4o-Audio \| 6.45 \| 8.50 \| 8.09 \| 0.64 \|
	\| Qwen3-Omni-32B \| 3.51 \| 8.66 \| 8.07 \| 0.74 \|
	\| Step-Audio 2 Mini \| 3.21 \| 7.32 \| 8.34 \| 0.66 \|
	\| MiDashengLM \| 4.50 \| 3.82 \| 8.43 \| 0.65 \|
	\| Kimi-Audio \| 3.36 \| 7.41 \| 8.26 \| 0.59 \|
	\| Qwen2.5-Omni-7B \| 3.45 \| 7.41 \| 5.93 \| 0.66 \|
	\| Hello-Chat \| 3.48 \| 7.68 \| 8.93 \| 0.69 \|

	#### Performance of Paralinguistic Understanding
	SER(speech emotion recognition) — evaluated on randomly sampled subsets from theEmoBox dataset, covering both Chinese and English speech.<br>
	AED(audio event detection) — evaluated using samples drawn from AudioSet and CochlScene.

	\| Model \| SER ↑ \| AED ↑ \|
	\|---\|---\|---\|
	\| Gemini3-Preview \| 0.791 \| 0.861 \|
	\| GPT-4o-Audio \| 0.586 \| 0.489 \|
	\| Qwen3-Omni-32B \| 0.856 \| 0.644 \|
	\| Step-Audio 2 Mini \| 0.680 \| 0.533 \|
	\| MiDashengLM \| 0.561 \| 0.441 \|
	\| Kimi-Audio \| 0.625 \| 0.392 \|
	\| Qwen2.5-Omni-7B \| 0.607 \| 0.584 \|
	\| Hello-Chat \| 0.824 \| 0.797 \|

	#### Instruction Following
	Only Yes — To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”

	\| Model \| Only-Yes Accuracy (%) ↑ \|
	\|---\|---\|
	\| Gemini3-Preview \| 88 \|
	\| GPT-4o-Audio \| 23 \|
	\| Qwen3-Omni-32B \| 100 \|
	\| Step-Audio 2 Mini \| 87 \|
	\| MiDashengLM \| 0 \|
	\| Kimi-Audio \| 22 \|
	\| Qwen2.5-Omni-7B \| 96 \|
	\| Hello-Chat \| 100 \|

	### Evaluation of Text to Speech
	Seed-TTS-Eval — We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.<br>
	Conversational-style Mean Opinion Score (CMOS) — We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a more authentic, human-like conversational flow and better alignment with the dialogue intent.

	\| Model \| CMOS ↑ \| CER (%) ↓ \| SS ↑ \|
	\|---\|---\|---\|---\|
	\| F5-TTS \| 3.48 \| 1.56 \| 0.741 \|
	\| CosyVoice \| 2 \| 3.66 \| 1.45 \| 0.748 \|
	\| CosyVoice 3-0.5B \| 3.59 \| 1.16 \| 0.780 \|
	\| Qwen2.5-Omni-7B \| - \| 1.70 \| 0.752 \|
	\| Qwen3-TTS-12Hz-0.6B-Base \| 4.12 \| 0.92 \| 0.763 \|
	\| FireRedTTS-2 \| 3.68 \| 1.14 \| 0.736 \|
	\| IndexTTS2 \| 4.16 \| 1.008 \| 0.764 \|
	\| Hello-Chat \| 4.19 \| 1.023 \| 0.748 \|

	## 🎧 Demos

	### Single Sentence Demo（zero-shot）


	#### Speaker1
	reference:
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female1.mp3"></audio>

	generated:
	##### “那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent1.mp3"></audio>


	##### “过年应该应该跟家里人一起吃饭。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent2.mp3"></audio>


	##### “哎呀，不是了，现在法治社会哪有卖假货的，只是卖的价格贵。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent3.mp3"></audio>

	---

	#### Speaker2
	reference:
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female2.mp3"></audio>

	generated:
	##### “但是这个时候上哪去找呢？找不到。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent4.mp3"></audio>


	##### “这种做法我感觉不适合，不是他那个年龄段该做出来的事情。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent5.mp3"></audio>


	##### “咱们得趁这个时机啊，看看还要剩多多久啊。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent6.mp3"></audio>

	---

	#### Speaker3
	reference:
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male1.mp3"></audio>

	generated:
	##### “我我不不怎么玩游戏，你你会玩游戏啊。

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent7.mp3"></audio>


	##### “对呀，就是不管你愿不愿意，时间都是一直往前推嘛。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent8.mp3"></audio>


	##### “挺好，我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗？”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent9.mp3"></audio>

	---

	#### Speaker4
	reference:
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male2.mp3"></audio>

	generated:
	##### “我也有二十多岁的时候，那个时候什么都不想，嗯，等那一点点沉淀，年龄大一点了，然后就什么都在乎，什么都想。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent10.mp3"></audio>


	##### “我看我一会儿，我我煮个泡面得了。”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent11.mp3"></audio>


	##### “他们说那个茶茶饼就是渣子压出来的，是吗？”

	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent12.mp3"></audio>

	---

	### Multi-Trun Conversation Demo（zero-shot）

	#### Conversation #1
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue1.mp3"></audio>

	---

	#### Conversation #2
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue2.mp3"></audio>

	---

	#### Conversation #3
	<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue3.mp3"></audio>


	## 📜 Citation

	If you find our work useful in your research, please consider citing:

	```bibtex
	@article{hellogroup2026hellochat,
	title={Hello-Chat: Towards Realistic Social Audio Interactions},
	author={Computational Intelligence Dept, HelloGroup Inc.},
	year={2026}
	}
	```