Official repository for Muti-human Interactive Talking Dataset
- We will release the code and dataset within 3 monthes.
- [2025.5.11] We initialize the Repo.
We present a high-quality dataset for multi-human interactive talking video generation, comprising over
|
|
|
|
|
|
To showcase the potential of our data collection pipeline and further increase dataset diversity, we expand our dataset with an additional 3 hours of video from the YouTube short film channel Omeleto, which features rich, natural interactions and diverse character dynamics. While these additional videos do not contain shot transitions, they may include camera motion, occlusions, and other real-world artifacts. We believe this subset serves as a challenging and complementary test set that augments the cleaner, studio-style data in the main dataset. Some samples are shown below:
|
|
|
If you want to construct your own dataset. Please navigate to the following folder structure:
├── data_collection_pipeline/
│ ├── 4_spaiens/
│ │ ├── spaiens.txt
│ ├── whisperV/
│ │ ├── whisperV.txt
│ ├── 1_whisperV_inference.py
│ ├── 2_select_valid_clips.py
│ ├── 3_speaking_annotate.py
│ ├── requirement.txt
Follow the instructions to prepare the env:
cd data_collection_pipeline
conda create -n mit_data python=3.9
pip3 install -r requirement.txt
conda activate mit_dataPut your raw videos in a folder, e.g., ./videos_input. Then run WhisperV inference (support multi-thread based on the number of your GPUs):
python 1_whisperV_inference.py --raw_data_path ./videos_input --save_root_path seg_output_pathThen select the valid clips and crop. You can specify the number of speakers in the video using the num_people in 2_select_valid_clips.py configuration parameter.
python 2_select_valid_clips.py --seg_result_root_path seg_output_path --save_root_path select_video_clip_save_path
python 3_speaking_annotate.py --seg_result_root seg_output_path --vaild_video_root select_video_clip_save_path output_path --datasets_root your_dataset_save_rootThen please refer to spaiens for pose estimation.
Multi-human talking video generation is an exciting yet challenging task, we are looking forward to seeing the contribution of your data! Any request please email me at: zeyuzhu2077@outlook.com.
We also release the training code of our baseline model CovOG. Follow the instructions to prepare the env:
cd baseline_train
conda create -n mit_train python=3.10
pip3 install -r requirement.txt
conda activate mit_trainDownload weights under the ./pretrained_weights direcotry.
python tools/download_weights.pyFirst, construct the data config as follow according to your path under ./data_config:
[
{
"image_dir": "",
"pose_dir": "",
"speak_info_path": "",
"audio_embedding_path": ""
},
]
Then run these lines to train the model for stage 1 and stage 2:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch train_stage_1.py --config configs/train/stage1_finetune.yaml
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch train_stage_2_CovOG.py --config configs/train/stage2_finetune_audio.yamlWe refer to the following codebases when building our pipeline:
- Sapiens: For body pose estimation annotation.
- WhisperV: For audio-visual speaker activity annotation.
- Hallo2: For auido control module.
- Moore-AnimateAnyone: For baseline model training.
We sincerely thank the authors of these works for their valuable open-source contributions, which greatly facilitated our research and development.











