Official pytorch implementation of out paper:
Can CLIP Help Sound Source Localization?
Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)
WACV 2024
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)
IJCV 2026
This repo is pytorch implementation of Audio-Grounded Contrastive Learning (ACL). Code is very simple and easy to understand fastly.
Some of these codes are based on AudioToken, BEATs, TCL.
- Python = 3.10.8
- Pytorch = 1.13.0
- transformers = 4.25.1
$ conda install -c nvidia cudatoolkit=11.7
$ conda install -c conda-forge cudnn
$ conda install python=3.10
$ pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
$ pip install tensorboard
$ pip transformers==4.25.1
$ pip install opencv-python
$ pip install tqdm
$ pip install scikit-learn
Important Note: All audio samples must be converted to 16kHz, and for detailed instructions, refer to the readme in each dataset-specific directory.
- Dataset
Downloading pretrained model (audio backbone) in pretrain folder
- BEATs: https://github.com/microsoft/unilm/tree/master/beats
- BEATs_iter3_plus_AS2M_finedtuned_on_AS2M_cpt2.pt
- Ensure that you check the .sh files and set the
$ export CUDA_VISIBLE_DEVICES=”**”according to your hardware setup. - Make sure that
—model_namecorresponds to the configuration file located at./config/model/{-model_name}.yaml. - Model files (.pth) will be saved in the directory
{—save_path}/Train_record/{-model_name}_{-exp_name}/. - Review the configuration settings in
./config/train/{-train_config}.yamlto ensure they match your training requirements. - Choose one of the following methods to initiate training:
$ sh SingleGPU_Experiment.sh. # For single GPU setup
$ sh Distributed_Experiment.sh. # For multi-GPU setup (DDP)- Before testing, please review the .sh file and set the
$ export CUDA_VISIBLE_DEVICES=”**”environment variable according to your hardware configuration. - Ensure that the
—model_nameparameter corresponds to the configuration file located at./config/model/{-model_name}.yaml. - Model files (.pth) located in the directory
{—save_path}/{-model_name}_{-exp_name}/Param_{-epochs}.pthwill be used for testing. - The
—epochsparameter can accept either an integer or a list of integers (e.g., 1, 2, 3). - If
—epochsis left unspecified (null), the default model file{—save_path}/Train_record/{-model_name}_{-exp_name}/Param_best.pthwill be used for testing.
$ sh Test_PTModelsImportant Note: After downloading the Param_best.pth file, move it to the directory {—save_path}/{-model_name}_{-exp_name}/ before use.
- VGG-Sound 144k trained model: [Link]
- This model was trained using a 2-GPU setup.
- The reported numbers are the highest, with performance varying across different seeds, and the provided .pth link corresponds to the checkpoint used for the highest result.
- Model trained with AV Caption (IJCV version): [Link]
If you use this project, please cite this project as:
@inproceedings{park2024can,
title={Can clip help sound source localization?},
author={Park, Sooyoung and Senocak, Arda and Chung, Joon Son},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
year={2024}
}
@article{park2026hearing,
title={Hearing and seeing through clip: A framework for self-supervised sound source localization},
author={Park, Sooyoung and Senocak, Arda and Chung, Joon Son},
journal={International Journal of Computer Vision},
volume={134},
number={4},
pages={179},
year={2026},
publisher={Springer}
}