Skip to content

YJX-Research/MultiSoundGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiSoundGen

arXiv     Demo
MultiSoundGen: a novel V2A framework for multi-event scenarios


MultiSoundGen introduces direct preference optimization (DPO) into the Video-to-Audio (V2A) domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios.

Key Contributions:
1. SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity;
2. We integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality.

Teaser


📰 News

  • Coming Soon   📝 All source codes, models and datasets will be released upon publication of the paper.
  • 2025.10   🔥Demo page covering various scenarios is now online!
  • 2025.09   🔥MultiSoundGen paper released on arXiv!

🚀 Features

  • AVP Model: Pioneering AVP model with a unified dual-stream architecture.
  • Model-Based DPO: Uses AVP model as a reward model to directly optimize V2A generation.
  • V2A SOTA: Achieves state-of-the-art V2A results in complex multi-event scenarios.

✨ Method Overview

The backbone of MultiSoundGen is MM-DiT trained with a CFM objective. Two key innovations underpin MultiSoundGen: SF-CAVP and AVP-RPO. AVP-RPO uses SF-CAVP as a reward model to iteratively optimize the base model, boosting audio-video alignment and audio quality.

MultiSoundGen Overview


📝 TODO

  • Release complete source code and training scripts
  • Release models and datasets

📖 Citation

If you find MultiSoundGen useful in your research or work, please cite our paper:

Yang, J.; Yang, X.; Zhang, L.; Guo, X.; Wang, Z.and Huang, G. 2025. MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization. arXiv preprint arXiv:2509.19999.

🙏 Acknowledgements

We would like to express our gratitude to all contributors and supporters of this project. Special thanks to the open-source community for providing valuable tools and frameworks that facilitated the development of MultiSoundGen. We also appreciate the feedback from reviewers and colleagues which helped improve this work.


⭐ Support Us

If you find our research interesting or useful, please consider ⭐ this repository. Your support is our greatest motivation!


📬 Contact

If you have any questions, suggestions, or would like to collaborate, please feel free to reach out to us:


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors