Skip to content

GreenBladePk/RecurVid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

RecurVid - Fast recursive image-to-video generation for short videos

Custom implementation of "Packing Input Frame Context in Next-Frame Prediction Models for Video Generation".

Links: Paper.

RecurVid is a next-frame (or next-frame-section) prediction neural network designed for efficient and scalable image-to-video generation. It generates videos progressively and recursively, enabling the synthesis of long video sequences from a single image with minimal compute.

Key design highlights include:

Progressive Generation: Videos are constructed in sequential chunks (e.g., 1-second segments), allowing for dynamic control over video length and consistent quality throughout.

Context Compression: Input context is compressed into a fixed-length representation, making the computational workload invariant to the final video length.

Scalability on Modest Hardware: The model architecture supports efficient inference even with 13B parameter models on laptop GPUs, enabling large-scale generation without requiring high-end infrastructure.

Training Efficiency: RecurVid supports large batch sizes, similar to those used in image diffusion model training, improving throughput and enabling better generalization during training.

Note that the ending actions will be generated before the starting actions due to the inverted sampling. If the starting action is not in the video, you just need to wait, and it will be generated later.

Video diffusion, but feels like image diffusion.

Requirements

  • Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
  • Linux or Windows operating system.
  • At least 6GB GPU memory.

To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB.

About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower.

Installation

Linux:

We recommend having an independent Python 3.10.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

To start the GUI, run:

python demo_gradio.py

Note that it supports --share, --port, --server, and so on.

The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.

For example, to install sage-attention (linux):

pip install sageattention==1.0.6

However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.

Image-to-5-seconds

Download this image:

Copy this prompt:

The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.

Set like this:

(all default parameters, with teacache turned off) image

The result will be:

0.mp4
Video may be compressed by GitHub

Know the influence of TeaCache and Quantization

Download this image:

Copy this prompt:

The girl dances gracefully, with clear movements, full of charm.

Set like this:

image

Turn off teacache:

image

You will get this:

2.mp4
Video may be compressed by GitHub

Now turn on teacache:

image

About 30% users will get this (the other 70% will get other random results depending on their hardware):

2teacache.mp4
A typical worse result.

So you can see that teacache is not really lossless and sometimes can influence the result a lot.

We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.

This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.

Prompting Guideline

Many people would ask how to write better prompts.

Below is a ChatGPT template that I personally often use to get prompts:

You are an assistant that writes short, motion-focused prompts for animating images.

When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.

Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).

Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."

If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.

Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.

You paste the instruct to ChatGPT and then feed it an image to get prompt like this:

The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.

Usually this will give you a prompt that works well.

You can also write prompts yourself. Concise prompts are usually preferred, for example:

The girl dances gracefully, with clear movements, full of charm.

The man dances powerfully, with clear movements, full of energy.

and so on.

About

RecurVid is a fast, recursive image-to-video generation model that creates high-quality short videos through next-frame prediction and progressive synthesis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages