File size: 9,911 Bytes
ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 67bfcfe ae29b16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
---
library_name: transformers
tags:
- automatic-speech-recognition
- audio-visual-speech-recognition
- multimodal
- speech-recognition
- lip-reading
- cocktail-party
- noise-robust
- av-hubert
- transformer
- pytorch
- audio
- video
- english
- lrs2
- voxceleb2
- ctc
- attention
- beam-search
- multi-speaker
- noisy-speech
datasets:
- nguyenvulebinh/AVYT
language:
- en
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios
**Official implementation** of "[Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178)" (Interspeech 2025).
A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.
## Getting Started
### Sections
1. <a href="#install">Installation</a>
2. <a href="#evaluation">Evaluation</a>
3. <a href="#training">Training</a>
## <a id="install">1. Installation </a>
Following this steps:
```sh
# Clone the baseline code repo
git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
cd AVSRCocktail
# Create Conda environment
conda create --name AVSRCocktail python=3.11
conda activate AVSRCocktail
# Install FFmpeg, if it's not already installed.
conda install ffmpeg
# Install dependencies
pip install -r requirements.txt
```
## <a id="evaluation">2. Evaluation</a>
The evaluation script `script/evaluation.py` provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.
### Quick Start
**Basic evaluation on LRS2 test set:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
```
**Evaluation on AVCocktail dataset:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
```
### Supported Datasets
#### 1. LRS2 Dataset
Evaluate on the LRS2 dataset with various noise conditions:
**Available test sets:**
- `test`: Clean test set
- `test_snr_n5_interferer_1`: SNR -5dB with 1 interferer
- `test_snr_n5_interferer_2`: SNR -5dB with 2 interferers
- `test_snr_0_interferer_1`: SNR 0dB with 1 interferer
- `test_snr_0_interferer_2`: SNR 0dB with 2 interferers
- `test_snr_5_interferer_1`: SNR 5dB with 1 interferer
- `test_snr_5_interferer_2`: SNR 5dB with 2 interferers
- `test_snr_10_interferer_1`: SNR 10dB with 1 interferer
- `test_snr_10_interferer_2`: SNR 10dB with 2 interferers
- `*`: Evaluate on all test sets and report average WER
**Example:**
```sh
# Evaluate on clean test set
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
# Evaluate on noisy conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1
# Evaluate on all conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
```
#### 2. AVCocktail Dataset
Evaluate on the AVCocktail cocktail party dataset:
**Available video sets:**
- `video_0` to `video_50`: Individual video sessions
- `*`: Evaluate on all video sessions and report average WER
The evaluation reports WER for three different chunking strategies:
- `asd_chunk`: Chunks based on Active Speaker Detection
- `fixed_chunk`: Fixed-duration chunks
- `gold_chunk`: Ground truth optimal chunks
**Example:**
```sh
# Evaluate on specific video
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
# Evaluate on all videos
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
```
### Configuration Options
#### Model Configuration
- `--model_type`: Model architecture to use (use `avsr_cocktail` for the AVSR Cocktail model)
- `--checkpoint_path`: Path to custom model checkpoint (default: uses pretrained `nguyenvulebinh/AVSRCocktail`)
- `--cache_dir`: Directory to cache downloaded models (default: `./model-bin`)
#### Processing Parameters
- `--max_length`: Maximum length of video segments in seconds (default: 15)
- `--beam_size`: Beam size for beam search decoding (default: 3)
#### Dataset Parameters
- `--dataset_name`: Dataset to evaluate on (`lrs2` or `AVCocktail`)
- `--set_id`: Specific subset to evaluate (see dataset-specific options above)
#### Output Options
- `--verbose`: Enable verbose output during processing
- `--output_dir_name`: Name of output directory for session processing (default: `output`)
### Advanced Usage
**Custom model checkpoint:**
```sh
python script/evaluation.py \
--model_type avsr_cocktail \
--dataset_name lrs2 \
--set_id test \
--checkpoint_path ./model-bin/my_custom_model \
--cache_dir ./custom_cache
```
**Optimized inference settings:**
```sh
python script/evaluation.py \
--model_type avsr_cocktail \
--dataset_name AVCocktail \
--set_id "*" \
--max_length 10 \
--beam_size 5 \
--verbose
```
### Output Format
The evaluation script outputs Word Error Rate (WER) scores:
**LRS2 evaluation output:**
```
WER test: 0.1234
```
**AVCocktail evaluation output:**
```
WER video_0 asd_chunk: 0.1234
WER video_0 fixed_chunk: 0.1456
WER video_0 gold_chunk: 0.1123
```
When using `--set_id "*"`, the script reports both individual and average WER scores across all test conditions.
## <a id="training">3. Training</a>
### Model Architecture
- **Encoder**: Pre-trained AV-HuBERT large model (`nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h`)
- **Decoder**: Transformer decoder with CTC/Attention joint training
- **Tokenization**: SentencePiece unigram tokenizer with 5000 vocabulary units
- **Input**: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate
### Training Data
The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at [nguyenvulebinh/AVYT](https://huggingface.co/datasets/nguyenvulebinh/AVYT) and include:
| Dataset | Size |
|---------|------|
| **LRS2** | ~145k samples |
| **VoxCeleb2** | ~540k samples |
| **AVYT** | ~717k samples |
| **AVYT-mix** | ~483k samples |
The information about these datasets can be found in the [Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178) paper.
**Dataset Features:**
- **Preprocessed**: All audio-visual data is pre-processed and ready for direct input to the training pipeline
- **Multi-modal**: Each sample contains synchronized audio and video (mouth crop) data
- **Labeled**: Text transcriptions for supervised learning
The training pipeline automatically handles dataset loading and loads data in [streaming mode](https://huggingface.co/docs/datasets/stream). However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.
### Training Process
The training script is available at `script/train.py`.
**Multi-GPU Distributed Training:**
```sh
# Set environment variables for distributed training
export NCCL_DEBUG=WARN
export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Run with torchrun for multi-GPU training (using default parameters)
torchrun --nproc_per_node 4 script/train.py
# Run with custom parameters
torchrun --nproc_per_node 4 script/train.py \
--streaming_dataset \
--batch_size 6 \
--max_steps 400000 \
--gradient_accumulation_steps 2 \
--save_steps 2000 \
--eval_steps 2000 \
--learning_rate 1e-4 \
--warmup_steps 4000 \
--checkpoint_name avsr_avhubert_ctcattn \
--model_name_or_path ./model-bin/avsr_cocktail \
--output_dir ./model-bin
```
**Model Output:**
The trained model will be saved by default in `model-bin/{checkpoint_name}/` (default: `model-bin/avsr_avhubert_ctcattn/`).
#### Configuration Options
You can customize training parameters using command line arguments:
**Dataset Options:**
- `--streaming_dataset`: Use streaming mode for datasets (default: False)
**Training Parameters:**
- `--batch_size`: Batch size per device (default: 6)
- `--max_steps`: Total training steps (default: 400000)
- `--learning_rate`: Initial learning rate (default: 1e-4)
- `--warmup_steps`: Learning rate warmup steps (default: 4000)
- `--gradient_accumulation_steps`: Gradient accumulation (default: 2)
**Checkpoint and Logging:**
- `--save_steps`: Checkpoint saving frequency (default: 2000)
- `--eval_steps`: Evaluation frequency (default: 2000)
- `--log_interval`: Logging frequency (default: 25)
- `--checkpoint_name`: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
- `--resume_from_checkpoint`: Resume training from last checkpoint (default: False)
**Model and Output:**
- `--model_name_or_path`: Path to pretrained model (default: "./model-bin/avsr_cocktail")
- `--output_dir`: Output directory for checkpoints (default: "./model-bin")
- `--report_to`: Logging backend, "wandb" or "none" (default: "none")
**Hardware Requirements:**
- **GPU Memory**: The default training configuration is designed to fit within **24GB GPU memory**
- **Training Time**: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately **56 hours per epoch**
- **Convergence**: **200,000 steps** (total batch size 24) is typically sufficient for model convergence
## Acknowledgement
This repository is built using the [auto_avsr](https://github.com/mpc001/auto_avsr), [espnet](https://github.com/espnet/espnet), and [avhubert](https://github.com/facebookresearch/av_hubert) repositories.
## Contact
nguyenvulebinh@gmail.com |