File size: 9,911 Bytes

ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
 
 
67bfcfe
ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
 
 
ae29b16
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
67bfcfe
ae29b16
 
 
 
 
 
 
 
 
 
 
 
 
 
67bfcfe
ae29b16
 
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
 
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
ae29b16
 
 
 
67bfcfe
 
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16
67bfcfe
ae29b16

---

library_name: transformers
tags:
- automatic-speech-recognition
- audio-visual-speech-recognition
- multimodal
- speech-recognition
- lip-reading
- cocktail-party
- noise-robust
- av-hubert
- transformer
- pytorch
- audio
- video
- english
- lrs2
- voxceleb2
- ctc
- attention
- beam-search
- multi-speaker
- noisy-speech
datasets:
- nguyenvulebinh/AVYT
language:
- en
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---


# AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios

**Official implementation** of "[Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178)" (Interspeech 2025).

A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.

## Getting Started

### Sections
1. <a href="#install">Installation</a>
2. <a href="#evaluation">Evaluation</a>
3. <a href="#training">Training</a>

## <a id="install">1. Installation </a>

Following this steps:

```sh
# Clone the baseline code repo
git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
cd AVSRCocktail

# Create Conda environment
conda create --name AVSRCocktail python=3.11
conda activate AVSRCocktail

# Install FFmpeg, if it's not already installed.
conda install ffmpeg

# Install dependencies
pip install -r requirements.txt
```

## <a id="evaluation">2. Evaluation</a>

The evaluation script `script/evaluation.py` provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.

### Quick Start

**Basic evaluation on LRS2 test set:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
```

**Evaluation on AVCocktail dataset:**
```sh
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
```

### Supported Datasets

#### 1. LRS2 Dataset
Evaluate on the LRS2 dataset with various noise conditions:

**Available test sets:**
- `test`: Clean test set
- `test_snr_n5_interferer_1`: SNR -5dB with 1 interferer
- `test_snr_n5_interferer_2`: SNR -5dB with 2 interferers  
- `test_snr_0_interferer_1`: SNR 0dB with 1 interferer
- `test_snr_0_interferer_2`: SNR 0dB with 2 interferers
- `test_snr_5_interferer_1`: SNR 5dB with 1 interferer
- `test_snr_5_interferer_2`: SNR 5dB with 2 interferers
- `test_snr_10_interferer_1`: SNR 10dB with 1 interferer
- `test_snr_10_interferer_2`: SNR 10dB with 2 interferers
- `*`: Evaluate on all test sets and report average WER

**Example:**
```sh
# Evaluate on clean test set
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test

# Evaluate on noisy conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1

# Evaluate on all conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
```

#### 2. AVCocktail Dataset
Evaluate on the AVCocktail cocktail party dataset:

**Available video sets:**
- `video_0` to `video_50`: Individual video sessions
- `*`: Evaluate on all video sessions and report average WER

The evaluation reports WER for three different chunking strategies:
- `asd_chunk`: Chunks based on Active Speaker Detection
- `fixed_chunk`: Fixed-duration chunks
- `gold_chunk`: Ground truth optimal chunks

**Example:**
```sh
# Evaluate on specific video
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0

# Evaluate on all videos
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
```

### Configuration Options

#### Model Configuration
- `--model_type`: Model architecture to use (use `avsr_cocktail` for the AVSR Cocktail model)
- `--checkpoint_path`: Path to custom model checkpoint (default: uses pretrained `nguyenvulebinh/AVSRCocktail`)
- `--cache_dir`: Directory to cache downloaded models (default: `./model-bin`)

#### Processing Parameters  
- `--max_length`: Maximum length of video segments in seconds (default: 15)
- `--beam_size`: Beam size for beam search decoding (default: 3)

#### Dataset Parameters
- `--dataset_name`: Dataset to evaluate on (`lrs2` or `AVCocktail`)
- `--set_id`: Specific subset to evaluate (see dataset-specific options above)

#### Output Options
- `--verbose`: Enable verbose output during processing
- `--output_dir_name`: Name of output directory for session processing (default: `output`)

### Advanced Usage

**Custom model checkpoint:**
```sh
python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name lrs2 \
    --set_id test \
    --checkpoint_path ./model-bin/my_custom_model \
    --cache_dir ./custom_cache
```

**Optimized inference settings:**
```sh
python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name AVCocktail \
    --set_id "*" \
    --max_length 10 \
    --beam_size 5 \
    --verbose
```

### Output Format

The evaluation script outputs Word Error Rate (WER) scores:

**LRS2 evaluation output:**
```
WER test: 0.1234
```

**AVCocktail evaluation output:**
```
WER video_0 asd_chunk: 0.1234
WER video_0 fixed_chunk: 0.1456  
WER video_0 gold_chunk: 0.1123
```

When using `--set_id "*"`, the script reports both individual and average WER scores across all test conditions.

## <a id="training">3. Training</a>

### Model Architecture

- **Encoder**: Pre-trained AV-HuBERT large model (`nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h`)
- **Decoder**: Transformer decoder with CTC/Attention joint training
- **Tokenization**: SentencePiece unigram tokenizer with 5000 vocabulary units
- **Input**: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate

### Training Data

The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at [nguyenvulebinh/AVYT](https://huggingface.co/datasets/nguyenvulebinh/AVYT) and include:

| Dataset | Size |
|---------|------|
| **LRS2** | ~145k samples |
| **VoxCeleb2** | ~540k samples |
| **AVYT** | ~717k samples |
| **AVYT-mix** | ~483k samples |

The information about these datasets can be found in the [Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178) paper.

**Dataset Features:**
- **Preprocessed**: All audio-visual data is pre-processed and ready for direct input to the training pipeline
- **Multi-modal**: Each sample contains synchronized audio and video (mouth crop) data
- **Labeled**: Text transcriptions for supervised learning

The training pipeline automatically handles dataset loading and loads data in [streaming mode](https://huggingface.co/docs/datasets/stream). However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.

### Training Process

The training script is available at `script/train.py`.

**Multi-GPU Distributed Training:**
```sh
# Set environment variables for distributed training
export NCCL_DEBUG=WARN
export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Run with torchrun for multi-GPU training (using default parameters)
torchrun --nproc_per_node 4 script/train.py

# Run with custom parameters
torchrun --nproc_per_node 4 script/train.py \
    --streaming_dataset \
    --batch_size 6 \
    --max_steps 400000 \
    --gradient_accumulation_steps 2 \
    --save_steps 2000 \
    --eval_steps 2000 \
    --learning_rate 1e-4 \
    --warmup_steps 4000 \
    --checkpoint_name avsr_avhubert_ctcattn \
    --model_name_or_path ./model-bin/avsr_cocktail \
    --output_dir ./model-bin
```

**Model Output:**
The trained model will be saved by default in `model-bin/{checkpoint_name}/` (default: `model-bin/avsr_avhubert_ctcattn/`).

#### Configuration Options

You can customize training parameters using command line arguments:

**Dataset Options:**
- `--streaming_dataset`: Use streaming mode for datasets (default: False)

**Training Parameters:**
- `--batch_size`: Batch size per device (default: 6)
- `--max_steps`: Total training steps (default: 400000)
- `--learning_rate`: Initial learning rate (default: 1e-4)
- `--warmup_steps`: Learning rate warmup steps (default: 4000)
- `--gradient_accumulation_steps`: Gradient accumulation (default: 2)

**Checkpoint and Logging:**
- `--save_steps`: Checkpoint saving frequency (default: 2000)
- `--eval_steps`: Evaluation frequency (default: 2000)
- `--log_interval`: Logging frequency (default: 25)
- `--checkpoint_name`: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
- `--resume_from_checkpoint`: Resume training from last checkpoint (default: False)

**Model and Output:**
- `--model_name_or_path`: Path to pretrained model (default: "./model-bin/avsr_cocktail")
- `--output_dir`: Output directory for checkpoints (default: "./model-bin")
- `--report_to`: Logging backend, "wandb" or "none" (default: "none")

**Hardware Requirements:**
- **GPU Memory**: The default training configuration is designed to fit within **24GB GPU memory**
- **Training Time**: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately **56 hours per epoch**
- **Convergence**: **200,000 steps** (total batch size 24) is typically sufficient for model convergence


## Acknowledgement

This repository is built using the [auto_avsr](https://github.com/mpc001/auto_avsr), [espnet](https://github.com/espnet/espnet), and [avhubert](https://github.com/facebookresearch/av_hubert) repositories.

## Contact 

nguyenvulebinh@gmail.com