SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
Paper β’ 2509.17537 β’ Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π₯2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewersππ
Download the official Ref-AVSBench dataset from here and organize the dataset as follows:
./REFAVS/data
- /media
- /gt_mask
- /metadata.csv
./models/segment_anything
### Checkpoints
Download our pretrained Simtoken.
### Core Requirements
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
- numpy, pandas, matplotlib, opencv
- einops, timm
- sentencepiece
- transformers, peft
Newer versions of transformers and peft may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).transformers==4.30.2
- peft==0.2.0
We also provide a complete requirements.txt for reference and easier reproduction:
pip install -r requirements.txt
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py --data_dir 'path/to/data'
To train our model on Ref-AVS Bench:
python -W ignore train.py --name 'xxx' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data'\
--log_root 'path/to/log_root'\
--checkpoint_root 'path/to/checkpoints_root'
To test our pretrained simtoken:
python -W ignore load_model.py --saved_model 'path/to/checkpoint.pth' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data' \
--visualization_root 'path/to/visualization_root'