YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
π° News
π₯2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewersππ
βοΈ Setup
Datasets
Download the official Ref-AVSBench dataset from here and organize the dataset as follows:
./REFAVS/data
- /media
- /gt_mask
- /metadata.csv
Pretrained Backbones
Download the sam_vit_h_4b8939.pth and put it in ./models/segment_anything
Checkpoints
Download our pretrained Simtoken.
Core Requirements
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
numpy,pandas,matplotlib,opencveinops,timmsentencepiecetransformers,peft
Newer versions of transformers and peft may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
To avoid such compatibility issues, we recommend not using overly recent versions and pin the two packages to the versions used during our development:
transformers==4.30.2peft==0.2.0
We also provide a complete requirements.txt for reference and easier reproduction:
pip install -r requirements.txt
π Getting Started
Preparation
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py --data_dir 'path/to/data'
Train
To train our model on Ref-AVS Bench:
python -W ignore train.py --name 'xxx' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data'\
--log_root 'path/to/log_root'\
--checkpoint_root 'path/to/checkpoints_root'
Test
To test our pretrained simtoken:
python -W ignore load_model.py --saved_model 'path/to/checkpoint.pth' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data' \
--visualization_root 'path/to/visualization_root'