ComfyUI custom node implementation of VideoMaMa for video matting with mask conditioning.
Original Research: VideoMaMa: Mask-Guided Video Matting via Generative Prior Original Repository: cvlab-kaist/VideoMaMa
This is a ComfyUI custom node implementation. All credit goes to the original authors for their excellent research and open-source contribution.
cd /path/to/ComfyUI/custom_nodes/
git clone https://github.com/okdalto/ComfyUI-VideoMaMa
cd ComfyUI-VideoMaMa
pip install -r requirements.txtThe Stable Video Diffusion base model will be automatically downloaded on first use if not present.
To download manually:
huggingface-cli download stabilityai/stable-video-diffusion-img2vid-xt \
--local-dir checkpoints/stabilityai/stable-video-diffusion-img2vid-xtThe VideoMaMa UNet checkpoint will be automatically downloaded on first use if not present.
To download manually:
huggingface-cli download SammyLim/VideoMaMa \
--local-dir checkpoints/VideoMaMa# Install SAM2
git clone https://github.com/facebookresearch/sam2
cd sam2 && pip install -e .
# Download checkpoint
mkdir -p ../checkpoints/sam2
cd ../checkpoints/sam2
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt
# Download config
mkdir -p ../../configs/sam2.1
cd ../../configs/sam2.1
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_l.yamlThe nodes will appear under the VideoMaMa category.
Loads the inference pipeline with base SVD model and fine-tuned UNet.
Inputs:
base_model_path: Path to base SVD model (default:checkpoints/stabilityai/stable-video-diffusion-img2vid-xt)unet_checkpoint_path: Path to fine-tuned UNet (default:checkpoints/VideoMaMa)precision:fp16orbf16(default:fp16)
Outputs:
VIDEOMAMA_PIPELINE: Pipeline object
Runs video matting inference with mask conditioning.
Inputs:
pipeline: Pipeline from loaderimages: Input video frames [N, H, W, C]masks: Mask frames [N, H, W, C]seed: Random seed (default: 42)max_resolution: Longest axis resolution for processing (default: 1024, range: 256-2048). Aspect ratio is preserved and dimensions are aligned to multiples of 8.fps: Frames per second (default: 7)motion_bucket_id: Motion intensity (default: 127)noise_aug_strength: Noise augmentation (default: 0.0)
Outputs:
MASK: Generated mask frames [N, H, W] (at original input resolution)
Generates masks using SAM2 video tracking (requires SAM2 installation).
Inputs:
images: Input video framescheckpoint_path: SAM2 checkpoint pathconfig_file: SAM2 config pathuser_input: Point coordinates from SAM2 Point Selector UI
Outputs:
IMAGE: Generated mask frames
Click the user_input field to open the interactive point selector:
Controls:
- Left click: Add positive point (+) - marks foreground/object to segment
- Right click: Add negative point (-) - marks background to exclude
- Middle click / Ctrl+click: Remove existing point
- + / - keys: Switch between positive and negative mode
Usage Tips:
- Place positive points (green) on the object you want to extract
- Place negative points (red) on background areas to exclude
- More points = more accurate segmentation
- Click Save to confirm, Cancel to discard, Clear All to reset
Example workflow files are available in the examples/ folder. Import these directly into ComfyUI to get started quickly.
Basic Steps:
- Load video → Use VHS Video Loader or similar
- Generate masks → Use SAM2 node or load existing masks
- Load pipeline → VideoMaMa Pipeline Loader
- Run inference → VideoMaMa Run (connect pipeline, images, masks)
- Save output → VHS Video Combine or Preview Image
- Resolution:
max_resolutioncontrols the longest axis. Aspect ratio is preserved and output is resized back to the original input resolution. For example, a 1920x1080 input withmax_resolution=1024is processed at 1024x576. - Motion Bucket: Lower (50-100) = subtle, Higher (150-200) = dynamic
- VRAM: Higher
max_resolutionrequires more VRAM
"SAM2 is not available"
git clone https://github.com/facebookresearch/sam2
cd sam2 && pip install -e ."Failed to load pipeline"
- Check model paths are correct
- Ensure all model files downloaded
- Check VRAM availability
"Frame count mismatch"
- Ensure image and mask sequences have same number of frames
- Python 3.10+
- PyTorch 2.0+ with CUDA
- GPU with sufficient VRAM
- ComfyUI
See requirements.txt for full dependencies.
VideoMaMa/
├── __init__.py
├── nodes.py
├── pipeline_svd_mask.py
├── requirements.txt
├── checkpoints/
│ ├── stabilityai/stable-video-diffusion-img2vid-xt/
│ ├── VideoMaMa/unet/
│ └── sam2/ (optional)
└── configs/sam2.1/ (optional)
This ComfyUI implementation is based on the excellent work by the KAIST CVLab team:
VideoMaMa: Mask-Guided Video Matting via Generative Prior
- Paper: https://arxiv.org/abs/2601.14255
- Original Repository: https://github.com/cvlab-kaist/VideoMaMa
- Authors: KAIST Computer Vision Lab
We are grateful to the authors for:
- Their groundbreaking research in video matting
- Making their code and models publicly available
- Advancing the field of generative video processing
This custom node is simply a wrapper to make VideoMaMa accessible in ComfyUI. All model weights, training methods, and core algorithms are from the original research.
If you use VideoMaMa in your work, please cite the original paper:
@article{videomama2025,
title={VideoMaMa: Mask-Guided Video Matting via Generative Prior},
author={[Authors from KAIST CVLab]},
journal={arXiv preprint arXiv:2601.14255},
year={2025}
}This project follows the original VideoMaMa license terms. Please refer to the original repository for licensing details.

