--- license: other title: Audio Flamingo 3 Demo sdk: gradio emoji: 🚀 colorFrom: green colorTo: green pinned: true short_description: Audio Flamingo 3 Demo ---
Audio Flamingo 3 🔥🚀🔥

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models

## Overview This repo contains the PyTorch implementation of [Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models](). Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in: - Unified audio representation learning (speech, sound, music) - Flexible, on-demand chain-of-thought reasoning (Thinking in Audio) - Long-context audio comprehension (including speech and up to 10 minutes) - Multi-turn, multi-audio conversational dialogue (AF3-Chat) - Voice-to-voice interaction (AF3-Chat) Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks. ## Main Results Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.
## Audio Flamingo 3 Architecture Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
## Installation ```bash ./environment_setup.sh af3 ``` ## Code Structure - The folder ```audio_flamingo_3/``` contains the main training and inference code of Audio Flamingo 3. - The folder ```audio_flamingo_3/scripts``` contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace. Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future. ## Single Line Inference To infer stage 3 model directly, run the command below: ```bash python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav ``` To infer the model in stage 3.5 model, run the command below: ```bash python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode ``` ## References The main training and inferencing code within each folder are modified from [NVILA](https://github.com/NVlabs/VILA/tree/main) [Apache license](incl_licenses/License_1.md). ## License - The code in this repo is under [MIT license](incl_licenses/MIT_license.md). - The checkpoints are for non-commercial use only [NVIDIA OneWay Noncommercial License](incl_licenses/NVIDIA_OneWay_Noncommercial_License.docx). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset. - Notice: Audio Flamingo 3 is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved. ## Citation - Audio Flamingo 2 ``` @article{ghosh2025audio, title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities}, author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2503.03983}, year={2025} } ``` - Audio Flamingo ``` @inproceedings{kong2024audio, title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities}, author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan}, booktitle={International Conference on Machine Learning}, pages={25125--25148}, year={2024}, organization={PMLR} } ```