Skip to content

HKU-MMLab/EVATok

Repository files navigation

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

arXiv  project page  Hugging Face 

Tianwei Xiong1*·Jun Hao Liew2·Zilong Huang2·Zhijie Lin2·Jiashi Feng2·Xihui Liu1✉
1The University of Hong Kong   2ByteDance Seed  
*Work partly done as an Intern at ByteDance Seed. ✉ Corresponding author  
CVPR 2026

🔈News

  • [2026/03/17] Research paper, code, and models are released for EVATok!

Introduction

We introduce EVATok, a framework that adaptively tokenizes videos into quality-cost optimal sequences. We show that content-adaptive video tokenization can surpass fixed-length baselines, achieving superior overall performances in reconstruction and downstream AR generation with fewer tokens.

🚀 In this codebase, we release

  • Routers that predict the optimal token assignment for each video clip.
  • Adaptive length video tokenizer and AR generative models.
  • Advanced video tokenizer training improvement implementations using video semantic encoders.
  • Comprehensive implementations for convenient exploration of adaptive tokenizer training and evaluation.

Environment setup

conda create --name evatok python=3.10
conda activate evatok
# works for CUDA 12.2  
bash scripts/env_install.sh

Download Checkpoints

Tokenizers and Routers

All the video tokenizers and routers are for 16x128x128 videos.

Tokenizer Train Set Config Param. (Tokenizer) router config router ckpt (link) #rTokens rFVD LPIPS Tokenizer ckpt (link)
S-B WebVid-10M VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml 145M router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt 721 7.3 0.1063 VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt
S-B UCF-101 & K600 VQ_SB_final_with_router_w_lpips_1.2.yaml 145M router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt 774 9.7 0.1140 VQ_SB_final_with_router_ucf_k600_1000k.pt
S-B (Proxy) WebVid-10M VQ_SB_proxy_3fps.yaml 145M - - - - - VQ_SB_proxy_3fps_webvid_400k.pt

AR models Downloading

Note that the inference of AR models will not use routers.

For UCF-101 Class-to-video Generation

AR Model Param. (AR) gFVD #gTokens AR Model Download Link Tok. ckpt Tok. Config router config router ckpt
GPT-L-plus 633M 48 756 GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt VQ_SB_final_with_router_ucf_k600_1000k.pt VQ_SB_final_with_router_w_lpips_1.2.yaml router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt
GPT-L 327M 62 756 GPT_L_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt VQ_SB_final_with_router_ucf_k600_1000k.pt VQ_SB_final_with_router_w_lpips_1.2.yaml router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt

If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.

For Kinetics-600 Frame Prediction

The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.

AR Model Param. (AR) gFVD #gTokens AR Model Download Link Tok. ckpt Tok. Config router config router ckpt
GPT-L-plus 633M 4.0 862 GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt VQ_SB_final_with_router_ucf_k600_1000k.pt VQ_SB_final_with_router_w_lpips_1.2.yaml router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt
GPT-L 327M 4.6 862 GPT_L_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt VQ_SB_final_with_router_ucf_k600_1000k.pt VQ_SB_final_with_router_w_lpips_1.2.yaml router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt

Inference and Evaluation

We provide easy-to-run qualitative evaluation scripts below. More quantitative evaluation scripts can be found in Detailed_instructions.

Tokenizer Reconstruction

To perform tokenizer reconstruction, you need to set up the required environment variables and run the reconstruction script.

For environment setup, modify the set_env_vars_template.sh script into set_env_vars.sh according to the comments in it. For this reconstruction task, you only need to specify the following variables: PROJECT_ROOT and PYPATH.

The script below will reconstruct videos in A_DIR_OF_VIDEOS using assignments adaptively predicted by the router. And the results will be saved in OUT_DIR, including reconstructed videos and the original video clips (automatically cropped to 16x128x128).

. scripts/set_env_vars.sh
# tokenizer: trained on WebVid-10M
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml
export VQ_CKPT=ckpts/VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt

# tokenizer: trained on UCF and K600
# export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
# export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt

# router: dependent on V-JEPA2
export ROUTER_CONFIG="configs/router/router_w_lpips_1.2.yaml"
export ROUTER_CKPT="ckpts/router_w_lpips_1.2_50k.pt"

# router: not dependent on V-JEPA2
# export ROUTER_CONFIG="configs/router/router_w_lpips_1.2_raw.yaml"
# export ROUTER_CKPT=ckpts/router_w_lpips_l1.2_raw_50k.pt

$PYPATH tokenizer/router/reconstruction_qual_with_router.py \
--vid_path A_DIR_OF_VIDEOS \
--save_dir OUT_DIR \
--router_config ${ROUTER_CONFIG} \
--router_ckpt ${ROUTER_CKPT} \
--tok_config ${TOK_CONFIG} \
--vq_ckpt ${VQ_CKPT}

For the quantitative reconstruction evaluation, see Detailed_instructions

Qualitative UCF-101 Class-to-video Generation

. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt
export GPT_MODEL=GPT-LP
export CFG=2.25
export GPT_CKPT="ckpts/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt"

export OUT_DIR=results/c2v_adaptive_qual_eval/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000_cfg${CFG}
export PRECISION="none"
export BSZ=8

# Try these classes by setting the --class-idx
# "ApplyLipstick": 1,
# "ApplyEyeMakeup": 0,
# "PushUps": 71,
# "WallPushups": 98,
# "JumpRope": 46,
# "BenchPress": 9,
# "PlayingGuitar": 62,
# "PlayingViolin": 66,
# "PlayingCello": 58,
# "PlayingSitar": 64,
# "SoccerJuggling": 83,
# "PullUps": 69,
# "Typing": 94,
# "Mixing": 53,
# "TableTennisShot": 89,
# "Rafting": 72,
# "WritingOnBoard": 99,
# "BodyWeightSquats": 14,
# "CuttingInKitchen": 24,

bash scripts/test/sample_c2v_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--sample-dir ${OUT_DIR} \
--cfg-scale ${CFG} \
--per-proc-batch-size ${BSZ} \
--qual-num 20 \
--class-idx 1,0,71,98,14 \
--check-special-token-mask

Qualitative Frame Prediction

K600 Frame Prediction

Make sure you have the K600 dataset prepared and the K600_VAL_FILE and K600_ROOT set in the set_env_vars.sh script.

. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt

export GPT_MODEL=GPT-LP
export GPT_CKPT="ckpts/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt"
export BSZ=8
export OUT_DIR=results/fp_adaptive_qual_eval/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75

bash scripts/test/frame_pred_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--dataset k600_val \
--sample-dir ${OUT_DIR} \
--per-proc-batch-size ${BSZ} \
--orig-aspect-ratio \
--check-special-token-mask \
--sample-num 20 \
--fixed-prefix "512,128" \
--condition-mode padding

Detailed Evaluation and Training Scripts

See Detailed_instructions

Acknowledgements

License

This project is licensed under the Apache 2.0 license - see the LICENSE file for details.

Citation

@article{xiong2025evatok,
  title={EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation},
  author={Xiong, Tianwei and Liew, Jun Hao and Huang, Zilong and Lin, Zhijie and Feng, Jiashi and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.12267},
  year={2026}
}

About

[CVPR 2026] Official repo for "EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors