Skip to content

mahshid1378/ThinkSound

Repository files navigation

ThinkSound

๐ŸŒ English | Espaรฑol | Franรงais | Japanese

NeurIPS 2025 ย  arXiv ย  Online Demo ย  Hugging Face ย  ModelScope

If you find this project useful,
a star โญ on GitHub would be greatly appreciated!


Repository layout

This ThinkSound GitHub repository hosts two related projects on separate branches:

Branch Project Documentation
master ThinkSound (NeurIPS 2025) โ€” unified Any2Audio generation with CoT-guided flow matching This file: README.md
prismaudio PrismAudio โ€” follow-up work (ICLR 2026) on video-to-audio with multi-dimensional CoT-RL README.md on the prismaudio branch

For ThinkSound, use branch master (this README). For PrismAudio, check out prismaudio and follow README.md there.


ThinkSound is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.

PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).

Teaser

๐Ÿ“ฐ News

  • 2026.03.24 ย  ๐Ÿ”ฅ PrismAudio is released in the same repo on branch prismaudio โ€” see README.md there for setup and models.
  • 2026.01.26 ย  ๐ŸŽ‰ PrismAudio accepted to ICLR 2026 Main Conference (code/docs on prismaudio).
  • 2025.11.25 ย  ๐Ÿ”ฅ Online PrismAudio Demo is live.
  • 2025.11.25 ย  ๐Ÿ”ฅ PrismAudio paper on arXiv โ€” multi-dimensional CoT-RL for video-to-audio.
  • 2025.09.19 ย  ๐ŸŽ‰ ThinkSound accepted to the NeurIPS 2025 Main Conference!
  • 2025.09.01 ย  Our AudioCoT dataset is now open-sourced and available on Hugging Face!
  • 2025.07.17 ย  ๐Ÿง  Finetuning enabled: training and finetuning code is now publicly available, along with clear usage instructions to help you customize and extend ThinkSound with your own data.
  • 2025.07.15 ย  ๐Ÿ“ฆ Simplified installation and usability: dependencies on PyPI for easy cross-platform setup; Windows .bat scripts automate environment creation and script running.
  • 2025.07.08 ย ย  ๐Ÿ”ง Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!
  • 2025.07.01 ย  Online demo on Hugging Face Spaces and ModelScope for interactive experience!
  • 2025.07.01 ย  Released inference scripts and web interface;
  • 2025.06 ย  ThinkSound paper released on arXiv!
  • 2025.06 ย  Online Demo is live - try it now!

Follow-up: PrismAudio (same repo, prismaudio branch)

PrismAudio is the successor to ThinkSound (ICLR 2026), developed under a new name but kept in this repository on branch prismaudio. Installation, checkpoints, and citation are in README.md on that branch.

๐Ÿ‘‰ git checkout prismaudio or open the branch on GitHub.


๐Ÿš€ Features

  • Any2Audio: Generate audio from arbitrary modalities โ€” video, text, audio, or their combinations.
  • Video-to-Audio SOTA: Achieves state-of-the-art results on multiple V2A benchmarks.
  • CoT-Driven Reasoning: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
  • Interactive Object-centric Editing: Refine or edit specific sound events by clicking on visual objects or using text instructions.
  • Unified Framework: One foundation model supports generation, editing, and interactive workflow.

โœจ Method Overview

ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:

  1. Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
  2. Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
  3. Targeted Audio Editing: Modify generated audio using high-level natural language instructions.

ThinkSound Overview


โšก Quick Start

Environment Preparation:

# ThinkSound code: branch master. PrismAudio: clone with -b prismaudio (see README.md on that branch).
git clone -b master https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
conda create -n thinksound python=3.10
conda activate thinksound
pip install thinksound
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
# To improve inference and training speed, you may optionally install a FlashAttention backend compatible with your system and PyTorch version.

โœ… Windows Tip:
Windows users can simply run setup_windows.bat (or double-click it) to automatically create the conda environment, install all dependencies (including FFmpeg), and download the pretrained model โ€” no manual setup required.
Make sure conda and git are installed and available in your system PATH before running the script.

โ–ถ๏ธ Run the Demo

Linux/macOS

chmod +x scripts/demo.sh
./scripts/demo.sh <path-to-your-demo-video> <title> <CoT description> [use-half]

Windows

You can use the provided .bat script instead:

.\scripts\demo.bat <path-to-your-demo-video> <title> <CoT description> [use-half]

Note:

  • <path-to-your-demo-video>: The path to a single video
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

๐Ÿ“ฆ Batch Inference

Linux/macOS

chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh <video_path> <csv_path> <save_path (optional)> [use-half]

Windows

Use the equivalent .bat script:

.\scripts\eval_batch.bat <video_path> <csv_path> <save_path (optional)> [use-half]

Note:

  • <video_path>: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).
  • <csv_path>: A CSV file with text prompts for each video (see demo_test.csv for format).
  • <save_path> (optional): Where to save generated audio. Defaults to results/features.
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

Web Interface Usage

For an interactive experience, launch the Gradio web interface:

python app.py

๐Ÿ‹๏ธ Train the Model

See Training.md


๐Ÿ“„ License

This project is released under the Apache 2.0 License.

Note: The code, models, and dataset are for research and educational purposes only. Commercial use is NOT permitted. For commercial licensing, please contact the authors.

๐Ÿ“ฆ Third-Party Components

  • Stable Audio Open VAE (by Stability AI): This repository includes a fine-tuned VAE from Stable Audio Open, licensed under the Stability AI Community License. Commercial use and redistribution require prior permission from Stability AI.

  • ๐Ÿ“˜ All other code and models are released under the Apache License 2.0.


Acknowledgements

Many thanks to:

  • stable-audio-tools (by Stability AI): For providing an easy-to-use framework for audio generation, as well as the VAE module and weights.
  • MMAudio: For the implementation of the MM-DiT backbone in the audio domain.

๐Ÿ“– Citation

If you find our project useful in your research or work, please cite our paper:

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
          title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, 
          author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
          year={2025},
          eprint={2511.18833},
          archivePrefix={arXiv},
          primaryClass={cs.SD},
          url={https://arxiv.org/abs/2511.18833}, 
    }

About

PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors