MultiSoundGen: a novel V2A framework for multi-event scenarios
MultiSoundGen introduces direct preference optimization (DPO) into the Video-to-Audio (V2A) domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios.
Key Contributions:
1. SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity;
2. We integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality.
- Coming Soon 📝 All source codes, models and datasets will be released upon publication of the paper.
- 2025.10 🔥Demo page covering various scenarios is now online!
- 2025.09 🔥MultiSoundGen paper released on arXiv!
- AVP Model: Pioneering AVP model with a unified dual-stream architecture.
- Model-Based DPO: Uses AVP model as a reward model to directly optimize V2A generation.
- V2A SOTA: Achieves state-of-the-art V2A results in complex multi-event scenarios.
The backbone of MultiSoundGen is MM-DiT trained with a CFM objective. Two key innovations underpin MultiSoundGen: SF-CAVP and AVP-RPO. AVP-RPO uses SF-CAVP as a reward model to iteratively optimize the base model, boosting audio-video alignment and audio quality.
- Release complete source code and training scripts
- Release models and datasets
If you find MultiSoundGen useful in your research or work, please cite our paper:
Yang, J.; Yang, X.; Zhang, L.; Guo, X.; Wang, Z.and Huang, G. 2025. MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization. arXiv preprint arXiv:2509.19999.
We would like to express our gratitude to all contributors and supporters of this project. Special thanks to the open-source community for providing valuable tools and frameworks that facilitated the development of MultiSoundGen. We also appreciate the feedback from reviewers and colleagues which helped improve this work.
If you find our research interesting or useful, please consider ⭐ this repository. Your support is our greatest motivation!
If you have any questions, suggestions, or would like to collaborate, please feel free to reach out to us:
- Email: yangjianxuan@xiaomi.com
- GitHub Issues: MultiSoundGen GitHub Issues

