SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

📰 News

🔥2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewers🎉🎉

⚙️ Setup

Datasets

Download the official Ref-AVSBench dataset from here and organize the dataset as follows:

./REFAVS/data 
    - /media 
    - /gt_mask 
    - /metadata.csv

Pretrained Backbones

Download the sam_vit_h_4b8939.pth and put it in ./models/segment_anything

Checkpoints

Download our pretrained Simtoken.

Core Requirements

This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.

numpy, pandas, matplotlib, opencv
einops, timm
sentencepiece
transformers, peft

Newer versions of transformers and peft may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
To avoid such compatibility issues, we recommend not using overly recent versions and pin the two packages to the versions used during our development:

transformers==4.30.2
peft==0.2.0

We also provide a complete requirements.txt for reference and easier reproduction:

pip install -r requirements.txt

📌 Getting Started

Preparation

We recommend running the following code to pre-extract audio features and visual features compatible with SAM:

python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py  --data_dir 'path/to/data'

Train

To train our model on Ref-AVS Bench:

python -W ignore train.py --name 'xxx' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data'\
    --log_root 'path/to/log_root'\
    --checkpoint_root 'path/to/checkpoints_root'

Test

To test our pretrained simtoken:

python -W ignore load_model.py  --saved_model 'path/to/checkpoint.pth' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data' \
    --visualization_root 'path/to/visualization_root'

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yfan07/SimToken

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

Paper • 2509.17537 • Published Sep 23, 2025