Custom implementation of "Packing Input Frame Context in Next-Frame Prediction Models for Video Generation".
Links: Paper.
RecurVid is a next-frame (or next-frame-section) prediction neural network designed for efficient and scalable image-to-video generation. It generates videos progressively and recursively, enabling the synthesis of long video sequences from a single image with minimal compute.
Key design highlights include:
Progressive Generation: Videos are constructed in sequential chunks (e.g., 1-second segments), allowing for dynamic control over video length and consistent quality throughout.
Context Compression: Input context is compressed into a fixed-length representation, making the computational workload invariant to the final video length.
Scalability on Modest Hardware: The model architecture supports efficient inference even with 13B parameter models on laptop GPUs, enabling large-scale generation without requiring high-end infrastructure.
Training Efficiency: RecurVid supports large batch sizes, similar to those used in image diffusion model training, improving throughput and enabling better generalization during training.
Note that the ending actions will be generated before the starting actions due to the inverted sampling. If the starting action is not in the video, you just need to wait, and it will be generated later.
Video diffusion, but feels like image diffusion.
- Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
- Linux or Windows operating system.
- At least 6GB GPU memory.
To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB.
About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower.
Linux:
We recommend having an independent Python 3.10.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
To start the GUI, run:
python demo_gradio.py
Note that it supports --share, --port, --server, and so on.
The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.
For example, to install sage-attention (linux):
pip install sageattention==1.0.6
However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.
Download this image:
Copy this prompt:
The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.
Set like this:
(all default parameters, with teacache turned off)

The result will be:
0.mp4 |
| Video may be compressed by GitHub |
Download this image:
Copy this prompt:
The girl dances gracefully, with clear movements, full of charm.
Set like this:
Turn off teacache:
You will get this:
2.mp4 |
| Video may be compressed by GitHub |
Now turn on teacache:
About 30% users will get this (the other 70% will get other random results depending on their hardware):
2teacache.mp4 |
| A typical worse result. |
So you can see that teacache is not really lossless and sometimes can influence the result a lot.
We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.
This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.
Many people would ask how to write better prompts.
Below is a ChatGPT template that I personally often use to get prompts:
You are an assistant that writes short, motion-focused prompts for animating images.
When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.
Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).
Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."
If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.
Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.
You paste the instruct to ChatGPT and then feed it an image to get prompt like this:
The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.
Usually this will give you a prompt that works well.
You can also write prompts yourself. Concise prompts are usually preferred, for example:
The girl dances gracefully, with clear movements, full of charm.
The man dances powerfully, with clear movements, full of energy.
and so on.



