AVSRCocktail / README.md

Update README.md

ae29b16 verified about 2 months ago

9.91 kB

	---
	library_name: transformers
	tags:
	- automatic-speech-recognition
	- audio-visual-speech-recognition
	- multimodal
	- speech-recognition
	- lip-reading
	- cocktail-party
	- noise-robust
	- av-hubert
	- transformer
	- pytorch
	- audio
	- video
	- english
	- lrs2
	- voxceleb2
	- ctc
	- attention
	- beam-search
	- multi-speaker
	- noisy-speech
	datasets:
	- nguyenvulebinh/AVYT
	language:
	- en
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	---

	# AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios

	Official implementation of "[Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178)" (Interspeech 2025).

	A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.

	## Getting Started

	### Sections
	1. <a href="#install">Installation</a>
	2. <a href="#evaluation">Evaluation</a>
	3. <a href="#training">Training</a>

	## <a id="install">1. Installation </a>

	Following this steps:

	```sh
	# Clone the baseline code repo
	git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
	cd AVSRCocktail

	# Create Conda environment
	conda create --name AVSRCocktail python=3.11
	conda activate AVSRCocktail

	# Install FFmpeg, if it's not already installed.
	conda install ffmpeg

	# Install dependencies
	pip install -r requirements.txt
	```

	## <a id="evaluation">2. Evaluation</a>

	The evaluation script `script/evaluation.py` provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.

	### Quick Start

	Basic evaluation on LRS2 test set:
	```sh
	python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
	```

	Evaluation on AVCocktail dataset:
	```sh
	python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
	```

	### Supported Datasets

	#### 1. LRS2 Dataset
	Evaluate on the LRS2 dataset with various noise conditions:

	Available test sets:
	- `test`: Clean test set
	- `test_snr_n5_interferer_1`: SNR -5dB with 1 interferer
	- `test_snr_n5_interferer_2`: SNR -5dB with 2 interferers
	- `test_snr_0_interferer_1`: SNR 0dB with 1 interferer
	- `test_snr_0_interferer_2`: SNR 0dB with 2 interferers
	- `test_snr_5_interferer_1`: SNR 5dB with 1 interferer
	- `test_snr_5_interferer_2`: SNR 5dB with 2 interferers
	- `test_snr_10_interferer_1`: SNR 10dB with 1 interferer
	- `test_snr_10_interferer_2`: SNR 10dB with 2 interferers
	- `*`: Evaluate on all test sets and report average WER

	Example:
	```sh
	# Evaluate on clean test set
	python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test

	# Evaluate on noisy conditions
	python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1

	# Evaluate on all conditions
	python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
	```

	#### 2. AVCocktail Dataset
	Evaluate on the AVCocktail cocktail party dataset:

	Available video sets:
	- `video_0` to `video_50`: Individual video sessions
	- `*`: Evaluate on all video sessions and report average WER

	The evaluation reports WER for three different chunking strategies:
	- `asd_chunk`: Chunks based on Active Speaker Detection
	- `fixed_chunk`: Fixed-duration chunks
	- `gold_chunk`: Ground truth optimal chunks

	Example:
	```sh
	# Evaluate on specific video
	python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0

	# Evaluate on all videos
	python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
	```

	### Configuration Options

	#### Model Configuration
	- `--model_type`: Model architecture to use (use `avsr_cocktail` for the AVSR Cocktail model)
	- `--checkpoint_path`: Path to custom model checkpoint (default: uses pretrained `nguyenvulebinh/AVSRCocktail`)
	- `--cache_dir`: Directory to cache downloaded models (default: `./model-bin`)

	#### Processing Parameters
	- `--max_length`: Maximum length of video segments in seconds (default: 15)
	- `--beam_size`: Beam size for beam search decoding (default: 3)

	#### Dataset Parameters
	- `--dataset_name`: Dataset to evaluate on (`lrs2` or `AVCocktail`)
	- `--set_id`: Specific subset to evaluate (see dataset-specific options above)

	#### Output Options
	- `--verbose`: Enable verbose output during processing
	- `--output_dir_name`: Name of output directory for session processing (default: `output`)

	### Advanced Usage

	Custom model checkpoint:
	```sh
	python script/evaluation.py \
	--model_type avsr_cocktail \
	--dataset_name lrs2 \
	--set_id test \
	--checkpoint_path ./model-bin/my_custom_model \
	--cache_dir ./custom_cache
	```

	Optimized inference settings:
	```sh
	python script/evaluation.py \
	--model_type avsr_cocktail \
	--dataset_name AVCocktail \
	--set_id "*" \
	--max_length 10 \
	--beam_size 5 \
	--verbose
	```

	### Output Format

	The evaluation script outputs Word Error Rate (WER) scores:

	LRS2 evaluation output:
	```
	WER test: 0.1234
	```

	AVCocktail evaluation output:
	```
	WER video_0 asd_chunk: 0.1234
	WER video_0 fixed_chunk: 0.1456
	WER video_0 gold_chunk: 0.1123
	```

	When using `--set_id "*"`, the script reports both individual and average WER scores across all test conditions.

	## <a id="training">3. Training</a>

	### Model Architecture

	- Encoder: Pre-trained AV-HuBERT large model (`nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h`)
	- Decoder: Transformer decoder with CTC/Attention joint training
	- Tokenization: SentencePiece unigram tokenizer with 5000 vocabulary units
	- Input: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate

	### Training Data

	The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at [nguyenvulebinh/AVYT](https://huggingface.co/datasets/nguyenvulebinh/AVYT) and include:

	\| Dataset \| Size \|
	\|---------\|------\|
	\| LRS2 \| ~145k samples \|
	\| VoxCeleb2 \| ~540k samples \|
	\| AVYT \| ~717k samples \|
	\| AVYT-mix \| ~483k samples \|

	The information about these datasets can be found in the [Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178) paper.

	Dataset Features:
	- Preprocessed: All audio-visual data is pre-processed and ready for direct input to the training pipeline
	- Multi-modal: Each sample contains synchronized audio and video (mouth crop) data
	- Labeled: Text transcriptions for supervised learning

	The training pipeline automatically handles dataset loading and loads data in [streaming mode](https://huggingface.co/docs/datasets/stream). However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.

	### Training Process

	The training script is available at `script/train.py`.

	Multi-GPU Distributed Training:
	```sh
	# Set environment variables for distributed training
	export NCCL_DEBUG=WARN
	export OMP_NUM_THREADS=1
	export CUDA_VISIBLE_DEVICES=0,1,2,3

	# Run with torchrun for multi-GPU training (using default parameters)
	torchrun --nproc_per_node 4 script/train.py

	# Run with custom parameters
	torchrun --nproc_per_node 4 script/train.py \
	--streaming_dataset \
	--batch_size 6 \
	--max_steps 400000 \
	--gradient_accumulation_steps 2 \
	--save_steps 2000 \
	--eval_steps 2000 \
	--learning_rate 1e-4 \
	--warmup_steps 4000 \
	--checkpoint_name avsr_avhubert_ctcattn \
	--model_name_or_path ./model-bin/avsr_cocktail \
	--output_dir ./model-bin
	```

	Model Output:
	The trained model will be saved by default in `model-bin/{checkpoint_name}/` (default: `model-bin/avsr_avhubert_ctcattn/`).

	#### Configuration Options

	You can customize training parameters using command line arguments:

	Dataset Options:
	- `--streaming_dataset`: Use streaming mode for datasets (default: False)

	Training Parameters:
	- `--batch_size`: Batch size per device (default: 6)
	- `--max_steps`: Total training steps (default: 400000)
	- `--learning_rate`: Initial learning rate (default: 1e-4)
	- `--warmup_steps`: Learning rate warmup steps (default: 4000)
	- `--gradient_accumulation_steps`: Gradient accumulation (default: 2)

	Checkpoint and Logging:
	- `--save_steps`: Checkpoint saving frequency (default: 2000)
	- `--eval_steps`: Evaluation frequency (default: 2000)
	- `--log_interval`: Logging frequency (default: 25)
	- `--checkpoint_name`: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
	- `--resume_from_checkpoint`: Resume training from last checkpoint (default: False)

	Model and Output:
	- `--model_name_or_path`: Path to pretrained model (default: "./model-bin/avsr_cocktail")
	- `--output_dir`: Output directory for checkpoints (default: "./model-bin")
	- `--report_to`: Logging backend, "wandb" or "none" (default: "none")

	Hardware Requirements:
	- GPU Memory: The default training configuration is designed to fit within 24GB GPU memory
	- Training Time: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately 56 hours per epoch
	- Convergence: 200,000 steps (total batch size 24) is typically sufficient for model convergence


	## Acknowledgement

	This repository is built using the [auto_avsr](https://github.com/mpc001/auto_avsr), [espnet](https://github.com/espnet/espnet), and [avhubert](https://github.com/facebookresearch/av_hubert) repositories.

	## Contact

	nguyenvulebinh@gmail.com