File size: 9,911 Bytes
ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
 
 
ae29b16
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
67bfcfe
ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
ae29b16
 
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
 
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
 
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---

library_name: transformers
tags:
- automatic-speech-recognition
- audio-visual-speech-recognition
- multimodal
- speech-recognition
- lip-reading
- cocktail-party
- noise-robust
- av-hubert
- transformer
- pytorch
- audio
- video
- english
- lrs2
- voxceleb2
- ctc
- attention
- beam-search
- multi-speaker
- noisy-speech
datasets:
- nguyenvulebinh/AVYT
language:
- en
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---


# AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios

**Official implementation** of "[Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178)" (Interspeech 2025).

A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.

## Getting Started

### Sections
1. <a href="#install">Installation</a>
2. <a href="#evaluation">Evaluation</a>
3. <a href="#training">Training</a>

## <a id="install">1. Installation </a>

Following this steps:

```sh
# Clone the baseline code repo
git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
cd AVSRCocktail

# Create Conda environment
conda create --name AVSRCocktail python=3.11
conda activate AVSRCocktail

# Install FFmpeg, if it's not already installed.
conda install ffmpeg

# Install dependencies
pip install -r requirements.txt
```

## <a id="evaluation">2. Evaluation</a>

The evaluation script `script/evaluation.py` provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.

### Quick Start

**Basic evaluation on LRS2 test set:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
```

**Evaluation on AVCocktail dataset:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
```

### Supported Datasets

#### 1. LRS2 Dataset
Evaluate on the LRS2 dataset with various noise conditions:

**Available test sets:**
- `test`: Clean test set
- `test_snr_n5_interferer_1`: SNR -5dB with 1 interferer
- `test_snr_n5_interferer_2`: SNR -5dB with 2 interferers  
- `test_snr_0_interferer_1`: SNR 0dB with 1 interferer
- `test_snr_0_interferer_2`: SNR 0dB with 2 interferers
- `test_snr_5_interferer_1`: SNR 5dB with 1 interferer
- `test_snr_5_interferer_2`: SNR 5dB with 2 interferers
- `test_snr_10_interferer_1`: SNR 10dB with 1 interferer
- `test_snr_10_interferer_2`: SNR 10dB with 2 interferers
- `*`: Evaluate on all test sets and report average WER

**Example:**
```sh
# Evaluate on clean test set
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test

# Evaluate on noisy conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1

# Evaluate on all conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
```

#### 2. AVCocktail Dataset
Evaluate on the AVCocktail cocktail party dataset:

**Available video sets:**
- `video_0` to `video_50`: Individual video sessions
- `*`: Evaluate on all video sessions and report average WER

The evaluation reports WER for three different chunking strategies:
- `asd_chunk`: Chunks based on Active Speaker Detection
- `fixed_chunk`: Fixed-duration chunks
- `gold_chunk`: Ground truth optimal chunks

**Example:**
```sh
# Evaluate on specific video
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0

# Evaluate on all videos
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
```

### Configuration Options

#### Model Configuration
- `--model_type`: Model architecture to use (use `avsr_cocktail` for the AVSR Cocktail model)
- `--checkpoint_path`: Path to custom model checkpoint (default: uses pretrained `nguyenvulebinh/AVSRCocktail`)
- `--cache_dir`: Directory to cache downloaded models (default: `./model-bin`)

#### Processing Parameters  
- `--max_length`: Maximum length of video segments in seconds (default: 15)
- `--beam_size`: Beam size for beam search decoding (default: 3)

#### Dataset Parameters
- `--dataset_name`: Dataset to evaluate on (`lrs2` or `AVCocktail`)
- `--set_id`: Specific subset to evaluate (see dataset-specific options above)

#### Output Options
- `--verbose`: Enable verbose output during processing
- `--output_dir_name`: Name of output directory for session processing (default: `output`)

### Advanced Usage

**Custom model checkpoint:**
```sh
python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name lrs2 \
    --set_id test \
    --checkpoint_path ./model-bin/my_custom_model \
    --cache_dir ./custom_cache
```

**Optimized inference settings:**
```sh
python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name AVCocktail \
    --set_id "*" \
    --max_length 10 \
    --beam_size 5 \
    --verbose
```

### Output Format

The evaluation script outputs Word Error Rate (WER) scores:

**LRS2 evaluation output:**
```
WER test: 0.1234
```

**AVCocktail evaluation output:**
```
WER video_0 asd_chunk: 0.1234
WER video_0 fixed_chunk: 0.1456  
WER video_0 gold_chunk: 0.1123
```

When using `--set_id "*"`, the script reports both individual and average WER scores across all test conditions.

## <a id="training">3. Training</a>

### Model Architecture

- **Encoder**: Pre-trained AV-HuBERT large model (`nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h`)
- **Decoder**: Transformer decoder with CTC/Attention joint training
- **Tokenization**: SentencePiece unigram tokenizer with 5000 vocabulary units
- **Input**: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate

### Training Data

The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at [nguyenvulebinh/AVYT](https://huggingface.co/datasets/nguyenvulebinh/AVYT) and include:

| Dataset | Size |
|---------|------|
| **LRS2** | ~145k samples |
| **VoxCeleb2** | ~540k samples |
| **AVYT** | ~717k samples |
| **AVYT-mix** | ~483k samples |

The information about these datasets can be found in the [Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178) paper.

**Dataset Features:**
- **Preprocessed**: All audio-visual data is pre-processed and ready for direct input to the training pipeline
- **Multi-modal**: Each sample contains synchronized audio and video (mouth crop) data
- **Labeled**: Text transcriptions for supervised learning

The training pipeline automatically handles dataset loading and loads data in [streaming mode](https://huggingface.co/docs/datasets/stream). However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.

### Training Process

The training script is available at `script/train.py`.

**Multi-GPU Distributed Training:**
```sh
# Set environment variables for distributed training
export NCCL_DEBUG=WARN
export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Run with torchrun for multi-GPU training (using default parameters)
torchrun --nproc_per_node 4 script/train.py

# Run with custom parameters
torchrun --nproc_per_node 4 script/train.py \
    --streaming_dataset \
    --batch_size 6 \
    --max_steps 400000 \
    --gradient_accumulation_steps 2 \
    --save_steps 2000 \
    --eval_steps 2000 \
    --learning_rate 1e-4 \
    --warmup_steps 4000 \
    --checkpoint_name avsr_avhubert_ctcattn \
    --model_name_or_path ./model-bin/avsr_cocktail \
    --output_dir ./model-bin
```

**Model Output:**
The trained model will be saved by default in `model-bin/{checkpoint_name}/` (default: `model-bin/avsr_avhubert_ctcattn/`).

#### Configuration Options

You can customize training parameters using command line arguments:

**Dataset Options:**
- `--streaming_dataset`: Use streaming mode for datasets (default: False)

**Training Parameters:**
- `--batch_size`: Batch size per device (default: 6)
- `--max_steps`: Total training steps (default: 400000)
- `--learning_rate`: Initial learning rate (default: 1e-4)
- `--warmup_steps`: Learning rate warmup steps (default: 4000)
- `--gradient_accumulation_steps`: Gradient accumulation (default: 2)

**Checkpoint and Logging:**
- `--save_steps`: Checkpoint saving frequency (default: 2000)
- `--eval_steps`: Evaluation frequency (default: 2000)
- `--log_interval`: Logging frequency (default: 25)
- `--checkpoint_name`: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
- `--resume_from_checkpoint`: Resume training from last checkpoint (default: False)

**Model and Output:**
- `--model_name_or_path`: Path to pretrained model (default: "./model-bin/avsr_cocktail")
- `--output_dir`: Output directory for checkpoints (default: "./model-bin")
- `--report_to`: Logging backend, "wandb" or "none" (default: "none")

**Hardware Requirements:**
- **GPU Memory**: The default training configuration is designed to fit within **24GB GPU memory**
- **Training Time**: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately **56 hours per epoch**
- **Convergence**: **200,000 steps** (total batch size 24) is typically sufficient for model convergence


## Acknowledgement

This repository is built using the [auto_avsr](https://github.com/mpc001/auto_avsr), [espnet](https://github.com/espnet/espnet), and [avhubert](https://github.com/facebookresearch/av_hubert) repositories.

## Contact 

nguyenvulebinh@gmail.com