💜 Project Page |
📑 Paper |
🤗 Hugging Face |
📺 YouTube
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Welcome to the official repository for VINO.
VINO is a unified visual generation framework that breaks down the barriers between image generation, video generation, and editing. Powered by a robust Vision-Language Model (VLM) and Multi-Modal Diffusion Transformer (MMDiT) architecture, VINO seamlessly interprets interleaved multi-modal inputs to achieve superior consistency and controllability.
Key Features:
- 👍 All-in-One Unified Model: A single model weight supports all tasks including Text-to-Image, Text-to-Video, Image-to-Video, and extensive Image/Video Editing.
- 👍 OmniModal Context: Deeply integrated with VLM to handle multi-image references, long-context instructions, and mixed-modal inputs for precise Instruction Following.
- 👍 Advanced Control: Supports sophisticated generation and editing capabilities.
vino_demo.mp4
- [2026.02.09] 🚀 We have officially released the VINO inference code and full model weights!
- [2026.01.06] 📑 The VINO paper is now available on arXiv.
- [2025.12.09] 🌐 The project page is live.
We recommend using Anaconda to create an isolated Python environment:
# Clone the repository
git clone https://github.com/SOTAMak1r/VINO-code.git
cd VINO-code
# Create environment
conda create -n vino python=3.10
conda activate vino
# Install dependencies
pip install -r requirements.txt
pip install "flash_attn==2.7.4.post1" --no-build-isolation
# [Optional] faster inference 🧨
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention
python setup.py installVINO uses a unified weight design, so you do not need to download different models for different tasks.
| Models | Download Link | Description |
|---|---|---|
| VINO | 🤗 Huggingface | Contains MMDiT weights and learnable tokens |
| HunyuanVideo | 🤗 Huggingface | Contains VAE weights |
| Qwen3VL | 🤗 Huggingface | Contains VLM weights |
We recommend using the script for automatic downloading:
python download.py --ak your_own_huggingface_akVINO supports various generation modes. We have integrated all functionalities into inference.py, allowing you to switch tasks easily by modifying parameters.
| Category | Command Flag | Task Name | Capability |
|---|---|---|---|
| Generation | t2i |
Text → Image | image synthesis from natural language |
t2v |
Text → Video | text-to-video generation | |
i2v |
Image → Video | Animate a single image with motion & dynamics | |
ti2v |
Multi-Image → Video | Video generation conditioned on multiple reference images | |
| Editing | ti2i |
Text-Instructed Image Edit | Instruction-based image editing |
ti2i_baseimg |
Image-Guided Image Edit | Edit image with an explicit reference image | |
tv2v |
Text-Instructed Video Edit | Instruction-based video editing | |
tiv2v |
Image-Guided Video Edit | Edit video using an additional image reference | |
| Control / Transfer | tiv2v_clone |
Element Cloning | Clone motion, camera, or expression from reference |
- One unified interface for generation, editing, and control
- Supports single-image, multi-image, and video conditioning
- Instruction-driven, reference-driven, or hybrid control
- Easily extensible to new tasks
In short: if it’s a visual generation or editing task, VINO can handle it.
Generate high-fidelity single-frame images.
torchrun --nproc_per_node=1 inference.py \
--json_path ./assets/test_data/tasks/t2i.json \
--output_height 640 --output_width 640 \
--output_path output/t2i --seed 666Generate coherent video clips. We utilize Sequence Parallelism to support high-resolution generation.
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/t2v.json \
--output_height 480 --output_width 848 --output_num_frames 85 \
--output_path output/t2v --seed 666
Animate static images. Supports both Caption descriptions and Instruction control.
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/i2v.json \
--output_height 480 --output_width 848 --output_num_frames 85 \
--output_path output/i2v --seed 666 \
--negative_prompt_video ''Perform image editing tasks using two distinct conditioning methods: pure text instruction or a combination of instruction and reference image.
Use --guidance_scale_image to control the strength of the visual reference.
# instruction
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/ti2i.json \
--output_path output/ti2i --seed 666
# instruction + reference image
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/ti2i_baseimg.json \
--guidance_scale_image 3.0 \
--output_path output/ti2i_baseimg --seed 666You can input multiple reference images, and the model will understand the relationship between them to generate a video.
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/ti2v.json \
--guidance_scale_image 3.0 \
--output_height 480 --output_width 848 --output_num_frames 81 \
--output_path output/ti2v --seed 666 \
--negative_prompt_video ''Supports video editing using only text instructions (tv2v) or combined with reference images (tiv2v).
The script automatically detects the aspect ratio of the input video.
# Instruction-based Editing
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/tv2v.json \
--output_num_frames 81 \
--output_path output/tv2v --seed 666 \
--negative_prompt_video ''
# Image-based Editing
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/tiv2v.json \
--output_num_frames 81 \
--guidance_scale_image 3.0 \
--output_path output/tiv2v --seed 666 VINO allows for unique "Cloning" effects, such as keeping the scene static while only moving the camera (Bullet Time effect).
💡 Tip: To achieve the Bullet Time effect (freezing the scene while moving the camera), we recommend using a specific negative prompt to suppress object dynamics:
negative_prompt: "moving objects, dynamic elements, animation, character movement, walking, running, talking, ... static camera, still image"
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/tiv2v_clone.json \
--output_num_frames 81 \
--output_path output/tiv2v_clone --seed 666 \
--negative_prompt_video ''
torchrun --nproc_per_node=8 inference.py \
--json_path ./assets/test_data/tasks/tiv2v_camclone.json \
--output_num_frames 81 \
--output_path output/tiv2v_camclone_neg --seed 666 \
--negative_prompt_video 'moving objects, dynamic elements, animation, character movement, walking, running, talking, blinking, living, breathing, flowing water, wind, distortion, morphing, mutating, shifting, jittery, flickering, frame interpolation artifacts, static camera, still image, frozen camera, low resolution, blurry, watermark, text, bad composition'
Leveraging the VLM's powerful comprehension capabilities, you can first caption the input and then generate, creating an "Understanding-then-Generation" pipeline.
# Generate refined captions using VLM
python understand.py \
--json_path ./assets/test_data/tasks/und_before.json \
--output_path ./assets/test_data/tasks/und_after.json 💡 Tip: You can customize the understanding task by modifying the prompt templates inside
understand.py.
Our inference.py provides extensive parameter configurations. You can check the defaults in VINOInferenceConfig:
--guidance_scale: Text guidance scale (CFG).--guidance_scale_image: Image guidance scale for Image-Conditioned tasks. Recommended range:1.0-3.0.--timestep_shift: Timestep shift scheduler adjustment.--output_num_frames: Number of frames to generate.
- Codebase: Apache License 2.0
- Model Weights: CC BY-NC 4.0 (Non-Commercial Use Only)
If you find VINO useful for your research, please consider citing our paper:
@article{chen2026vino,
title={VINO: A Unified Visual Generator with Interleaved OmniModal Context},
author={Chen, Junyi and He, Tong and Fu, Zhoujie and Wan, Pengfei and Gai, Kun and Ye, Weicai},
journal={arXiv preprint arXiv:2601.02358},
year={2026}
}