YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

TGS


πŸ“° News

πŸ”₯2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewersπŸŽ‰πŸŽ‰


βš™οΈ Setup

Datasets

Download the official Ref-AVSBench dataset from here and organize the dataset as follows:

./REFAVS/data 
    - /media 
    - /gt_mask 
    - /metadata.csv 

Pretrained Backbones

Download the sam_vit_h_4b8939.pth and put it in ./models/segment_anything

Checkpoints

Download our pretrained Simtoken.

Core Requirements

This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.

  • numpy, pandas, matplotlib, opencv
  • einops, timm
  • sentencepiece
  • transformers, peft

Newer versions of transformers and peft may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
To avoid such compatibility issues, we recommend not using overly recent versions and pin the two packages to the versions used during our development:

  • transformers==4.30.2
  • peft==0.2.0

We also provide a complete requirements.txt for reference and easier reproduction:

pip install -r requirements.txt

πŸ“Œ Getting Started

Preparation

We recommend running the following code to pre-extract audio features and visual features compatible with SAM:

python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py  --data_dir 'path/to/data'

Train

To train our model on Ref-AVS Bench:

python -W ignore train.py --name 'xxx' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data'\
    --log_root 'path/to/log_root'\
    --checkpoint_root 'path/to/checkpoints_root'

Test

To test our pretrained simtoken:

python -W ignore load_model.py  --saved_model 'path/to/checkpoint.pth' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data' \
    --visualization_root 'path/to/visualization_root'
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for yfan07/SimToken