AvatarForcing is a one-step streaming diffusion framework for talking avatars. It generates video from one reference image + speech audio + (optional) text prompt.
- Identity is anchored by the input image (used as the first frame).
- Audio conditioning uses a streaming speech encoder (Wav2Vec2) to produce per-frame embeddings.
- A fixed local-future sliding window with heterogeneous noise levels is jointly denoised, emitting one clean block per step under constant per-step cost.
- One-step sliding-window denoising: local-future look-ahead with heterogeneous noise, emit one clean block per step
- Dual-anchor temporal forcing: style anchor (RoPE re-index) + temporal anchor (reuse recent clean blocks) + anchor-audio zero-padding
- Two-stage streaming distillation: offline ODE backfill + distribution matching
- Video: 25 FPS, 832×480
- Blocks/windows: B=4 frames/block, L=4 window length, N=1 pass
- Speed: paper reports 34 ms/frame with a 1.3B student model (hardware-dependent)
conda env create -f environment.yml
conda activate avatarforcingFFmpeg is used to mux the input audio into the output mp4:
- Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y ffmpeg
| Models | Download Link | Notes |
|---|---|---|
| Wan2.1-T2V-1.3B | 🤗 Huggingface | Base model (student) |
| AvatarForcing | 🤗 Huggingface | ODE init + DMD weights |
| Wav2Vec | 🤗 Huggingface | Audio encoder |
Download models using huggingface-cli:
pip install "huggingface_hub[cli]"
mkdir -p wan_models checkpoints
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
--local-dir-use-symlinks False \
--local-dir ./wan_models/Wan-T2V-1.3
huggingface-cli download facebook/wav2vec2-base-960h \
--local-dir-use-symlinks False \
--local-dir ./wan_models/wav2vec2-base-960h
huggingface-cli download lycui/AvatarForcing \
--local-dir-use-symlinks False \
--local-dir ./checkpointsRun inference:
python3 inference.py \
--config_path configs/avatarforcing.yaml \
--output_folder outputs \
--checkpoint_path checkpoints/model.pt \
--data_path <your_data_path> \
--num_output_frames 225 \
--i2vNotes:
- The script currently supports I2V only (requires
--i2v). - Choose
--num_output_framesso that(--num_output_frames - 1)is divisible byB=4(e.g.,225 = 4 * 56 + 1). - If you downloaded weights to
checkpoints/, set--checkpoint_pathto the actual.ptfile you want to load (e.g.,checkpoints/model.pt).
@misc{cui2026avatarforcingonestepstreamingtalking,
title={AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising},
author={Liyuan Cui and Wentao Hu and Wenyuan Zhang and Zesong Yang and Fan Shi and Xiaoqiang Liu},
year={2026},
eprint={2603.14331},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.14331},
}- We thank the authors of Wan2.1, OmniAvatar, Wan-S2V, and Self-Forcing for releasing their models and code, which provided valuable references and support for this work. We appreciate their contributions to the open-source community.
Apache-2.0