Audio-Grounded Contrastive Learning (WACV’24, IJCV’26)

Official pytorch implementation of out paper:

Can CLIP Help Sound Source Localization?

Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)

WACV 2024

Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)

IJCV 2026

Introduction

This repo is pytorch implementation of Audio-Grounded Contrastive Learning (ACL). Code is very simple and easy to understand fastly.

Some of these codes are based on AudioToken, BEATs, TCL.

Demo:

Required packages

Python = 3.10.8
Pytorch = 1.13.0
transformers = 4.25.1

Installation

$ conda install -c nvidia cudatoolkit=11.7
$ conda install -c conda-forge cudnn
$ conda install python=3.10
$ pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
$ pip install tensorboard
$ pip transformers==4.25.1
$ pip install opencv-python
$ pip install tqdm
$ pip install scikit-learn

Data preparation

Important Note: All audio samples must be converted to 16kHz, and for detailed instructions, refer to the readme in each dataset-specific directory.

Dataset
- VGG-Sound: [Link]
  - VGG-SS: [Link]
- Flickr: [Link]
- AVSBench: [Link]
- Extended VGG-SS/Flickr: [Link]

Model preparation

Downloading pretrained model (audio backbone) in pretrain folder

BEATs: https://github.com/microsoft/unilm/tree/master/beats
- BEATs_iter3_plus_AS2M_finedtuned_on_AS2M_cpt2.pt

Training

Ensure that you check the .sh files and set the $ export CUDA_VISIBLE_DEVICES=”**” according to your hardware setup.
Make sure that —model_name corresponds to the configuration file located at ./config/model/{-model_name}.yaml.
Model files (.pth) will be saved in the directory {—save_path}/Train_record/{-model_name}_{-exp_name}/.
Review the configuration settings in ./config/train/{-train_config}.yaml to ensure they match your training requirements.
Choose one of the following methods to initiate training:

$ sh SingleGPU_Experiment.sh. # For single GPU setup
$ sh Distributed_Experiment.sh. # For multi-GPU setup (DDP)

Test

Before testing, please review the .sh file and set the $ export CUDA_VISIBLE_DEVICES=”**” environment variable according to your hardware configuration.
Ensure that the —model_name parameter corresponds to the configuration file located at ./config/model/{-model_name}.yaml.
Model files (.pth) located in the directory {—save_path}/{-model_name}_{-exp_name}/Param_{-epochs}.pth will be used for testing.
The —epochs parameter can accept either an integer or a list of integers (e.g., 1, 2, 3).
If —epochs is left unspecified (null), the default model file {—save_path}/Train_record/{-model_name}_{-exp_name}/Param_best.pth will be used for testing.

$ sh Test_PTModels

Pretrained models

Important Note: After downloading the Param_best.pth file, move it to the directory {—save_path}/{-model_name}_{-exp_name}/ before use.

VGG-Sound 144k trained model: [Link]
- This model was trained using a 2-GPU setup.
- The reported numbers are the highest, with performance varying across different seeds, and the provided .pth link corresponds to the checkpoint used for the highest result.
Model trained with AV Caption (IJCV version): [Link]

Citation

If you use this project, please cite this project as:

@inproceedings{park2024can,
  title={Can clip help sound source localization?},
  author={Park, Sooyoung and Senocak, Arda and Chung, Joon Son},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  year={2024}
}

@article{park2026hearing,
  title={Hearing and seeing through clip: A framework for self-supervised sound source localization},
  author={Park, Sooyoung and Senocak, Arda and Chung, Joon Son},
  journal={International Journal of Computer Vision},
  volume={134},
  number={4},
  pages={179},
  year={2026},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
AVSBench		AVSBench
Flickr		Flickr
VGGSS		VGGSS
asset		asset
config		config
modules		modules
pretrain		pretrain
Distributed_Experiment.sh		Distributed_Experiment.sh
Eval.py		Eval.py
README.md		README.md
SingleGPU_Experiment.sh		SingleGPU_Experiment.sh
Test_PTModels.py		Test_PTModels.py
Test_PTModels.sh		Test_PTModels.sh
Train_ACL.py		Train_ACL.py
loss_utils.py		loss_utils.py
util.py		util.py
viz_utils.py		viz_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Grounded Contrastive Learning (WACV’24, IJCV’26)

Introduction

Required packages

Installation

Data preparation

Model preparation

Training

Test

Pretrained models

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio-Grounded Contrastive Learning (WACV’24, IJCV’26)

Introduction

Required packages

Installation

Data preparation

Model preparation

Training

Test

Pretrained models

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages