SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

📰 News

🔥2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewers🎉🎉

⚙️ Setup

Datasets

Download the official Ref-AVSBench dataset from here and organize the dataset as follows:

./REFAVS/data 
    - /media 
    - /gt_mask 
    - /metadata.csv

Pretrained Backbones

Download the sam_vit_h_4b8939.pth and put it in `./models/segment_anything` ### Checkpoints Download our pretrained Simtoken. ### Core Requirements This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution. - `numpy`, `pandas`, `matplotlib`, `opencv` - `einops`, `timm` - `sentencepiece` - `transformers`, `peft` Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
To avoid such compatibility issues, we recommend not using overly recent versions and pin the two packages to the versions used during our development: - `transformers==4.30.2` - `peft==0.2.0` We also provide a complete requirements.txt for reference and easier reproduction: `pip install -r requirements.txt`

📌 Getting Started

Preparation

We recommend running the following code to pre-extract audio features and visual features compatible with SAM:

python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py  --data_dir 'path/to/data'

Train

To train our model on Ref-AVS Bench:

python -W ignore train.py --name 'xxx' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data'\
    --log_root 'path/to/log_root'\
    --checkpoint_root 'path/to/checkpoints_root'

Test

To test our pretrained simtoken:

python -W ignore load_model.py  --saved_model 'path/to/checkpoint.pth' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data' \
    --visualization_root 'path/to/visualization_root'

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yfan07/SimToken

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

Paper • 2509.17537 • Published Sep 23, 2025