MultiSoundGen

MultiSoundGen: a novel V2A framework for multi-event scenarios

MultiSoundGen introduces direct preference optimization (DPO) into the Video-to-Audio (V2A) domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios.

Key Contributions:
1. SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity;
2. We integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality.

📰 News

Coming Soon 📝 All source codes, models and datasets will be released upon publication of the paper.
2025.10 🔥Demo page covering various scenarios is now online!
2025.09 🔥MultiSoundGen paper released on arXiv!

🚀 Features

AVP Model: Pioneering AVP model with a unified dual-stream architecture.
Model-Based DPO: Uses AVP model as a reward model to directly optimize V2A generation.
V2A SOTA: Achieves state-of-the-art V2A results in complex multi-event scenarios.

✨ Method Overview

The backbone of MultiSoundGen is MM-DiT trained with a CFM objective. Two key innovations underpin MultiSoundGen: SF-CAVP and AVP-RPO. AVP-RPO uses SF-CAVP as a reward model to iteratively optimize the base model, boosting audio-video alignment and audio quality.

📝 TODO

Release complete source code and training scripts
Release models and datasets

📖 Citation

If you find MultiSoundGen useful in your research or work, please cite our paper:

Yang, J.; Yang, X.; Zhang, L.; Guo, X.; Wang, Z.and Huang, G. 2025. MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization. arXiv preprint arXiv:2509.19999.

🙏 Acknowledgements

We would like to express our gratitude to all contributors and supporters of this project. Special thanks to the open-source community for providing valuable tools and frameworks that facilitated the development of MultiSoundGen. We also appreciate the feedback from reviewers and colleagues which helped improve this work.

⭐ Support Us

If you find our research interesting or useful, please consider ⭐ this repository. Your support is our greatest motivation!

📬 Contact

If you have any questions, suggestions, or would like to collaborate, please feel free to reach out to us:

Email: yangjianxuan@xiaomi.com
GitHub Issues: MultiSoundGen GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Figs		Figs
videos		videos
LICENSE		LICENSE
README.md		README.md
abstract.jpg		abstract.jpg
index.html		index.html
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiSoundGen

📰 News

🚀 Features

✨ Method Overview

📝 TODO

📖 Citation

🙏 Acknowledgements

⭐ Support Us

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiSoundGen

📰 News

🚀 Features

✨ Method Overview

📝 TODO

📖 Citation

🙏 Acknowledgements

⭐ Support Us

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages