EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Tianwei Xiong^1* · Jun Hao Liew² · Zilong Huang² · Zhijie Lin² · Jiashi Feng² · Xihui Liu^1✉
¹The University of Hong Kong ²ByteDance Seed
^*Work partly done as an Intern at ByteDance Seed. ✉ Corresponding author
CVPR 2026

🔈News

[2026/03/17] Research paper, code, and models are released for EVATok!

Introduction

We introduce EVATok, a framework that adaptively tokenizes videos into quality-cost optimal sequences. We show that content-adaptive video tokenization can surpass fixed-length baselines, achieving superior overall performances in reconstruction and downstream AR generation with fewer tokens.

🚀 In this codebase, we release

Routers that predict the optimal token assignment for each video clip.
Adaptive length video tokenizer and AR generative models.
Advanced video tokenizer training improvement implementations using video semantic encoders.
Comprehensive implementations for convenient exploration of adaptive tokenizer training and evaluation.

Environment setup

conda create --name evatok python=3.10
conda activate evatok
# works for CUDA 12.2  
bash scripts/env_install.sh

Download Checkpoints

Tokenizers and Routers

All the video tokenizers and routers are for 16x128x128 videos.

Tokenizer	Train Set	Config	Param. (Tokenizer)	router config	router ckpt (link)	#rTokens	rFVD	LPIPS	Tokenizer ckpt (link)
S-B	WebVid-10M	VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml	145M	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt	721	7.3	0.1063	VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt
S-B	UCF-101 & K600	VQ_SB_final_with_router_w_lpips_1.2.yaml	145M	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt	774	9.7	0.1140	VQ_SB_final_with_router_ucf_k600_1000k.pt
S-B (Proxy)	WebVid-10M	VQ_SB_proxy_3fps.yaml	145M	-	-	-	-	-	VQ_SB_proxy_3fps_webvid_400k.pt

AR models Downloading

Note that the inference of AR models will not use routers.

For UCF-101 Class-to-video Generation

AR Model	Param. (AR)	gFVD	#gTokens	AR Model Download Link	Tok. ckpt	Tok. Config	router config	router ckpt
GPT-L-plus	633M	48	756	GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt
GPT-L	327M	62	756	GPT_L_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt

If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.

For Kinetics-600 Frame Prediction

The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.

AR Model	Param. (AR)	gFVD	#gTokens	AR Model Download Link	Tok. ckpt	Tok. Config	router config	router ckpt
GPT-L-plus	633M	4.0	862	GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt
GPT-L	327M	4.6	862	GPT_L_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt

Inference and Evaluation

We provide easy-to-run qualitative evaluation scripts below. More quantitative evaluation scripts can be found in Detailed_instructions.

Tokenizer Reconstruction

To perform tokenizer reconstruction, you need to set up the required environment variables and run the reconstruction script.

For environment setup, modify the set_env_vars_template.sh script into set_env_vars.sh according to the comments in it. For this reconstruction task, you only need to specify the following variables: PROJECT_ROOT and PYPATH.

The script below will reconstruct videos in A_DIR_OF_VIDEOS using assignments adaptively predicted by the router. And the results will be saved in OUT_DIR, including reconstructed videos and the original video clips (automatically cropped to 16x128x128).

. scripts/set_env_vars.sh
# tokenizer: trained on WebVid-10M
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml
export VQ_CKPT=ckpts/VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt

# tokenizer: trained on UCF and K600
# export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
# export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt

# router: dependent on V-JEPA2
export ROUTER_CONFIG="configs/router/router_w_lpips_1.2.yaml"
export ROUTER_CKPT="ckpts/router_w_lpips_1.2_50k.pt"

# router: not dependent on V-JEPA2
# export ROUTER_CONFIG="configs/router/router_w_lpips_1.2_raw.yaml"
# export ROUTER_CKPT=ckpts/router_w_lpips_l1.2_raw_50k.pt

$PYPATH tokenizer/router/reconstruction_qual_with_router.py \
--vid_path A_DIR_OF_VIDEOS \
--save_dir OUT_DIR \
--router_config ${ROUTER_CONFIG} \
--router_ckpt ${ROUTER_CKPT} \
--tok_config ${TOK_CONFIG} \
--vq_ckpt ${VQ_CKPT}

For the quantitative reconstruction evaluation, see Detailed_instructions

Qualitative UCF-101 Class-to-video Generation

. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt
export GPT_MODEL=GPT-LP
export CFG=2.25
export GPT_CKPT="ckpts/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt"

export OUT_DIR=results/c2v_adaptive_qual_eval/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000_cfg${CFG}
export PRECISION="none"
export BSZ=8

# Try these classes by setting the --class-idx
# "ApplyLipstick": 1,
# "ApplyEyeMakeup": 0,
# "PushUps": 71,
# "WallPushups": 98,
# "JumpRope": 46,
# "BenchPress": 9,
# "PlayingGuitar": 62,
# "PlayingViolin": 66,
# "PlayingCello": 58,
# "PlayingSitar": 64,
# "SoccerJuggling": 83,
# "PullUps": 69,
# "Typing": 94,
# "Mixing": 53,
# "TableTennisShot": 89,
# "Rafting": 72,
# "WritingOnBoard": 99,
# "BodyWeightSquats": 14,
# "CuttingInKitchen": 24,

bash scripts/test/sample_c2v_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--sample-dir ${OUT_DIR} \
--cfg-scale ${CFG} \
--per-proc-batch-size ${BSZ} \
--qual-num 20 \
--class-idx 1,0,71,98,14 \
--check-special-token-mask

Qualitative Frame Prediction

K600 Frame Prediction

Make sure you have the K600 dataset prepared and the K600_VAL_FILE and K600_ROOT set in the set_env_vars.sh script.

. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt

export GPT_MODEL=GPT-LP
export GPT_CKPT="ckpts/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt"
export BSZ=8
export OUT_DIR=results/fp_adaptive_qual_eval/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75

bash scripts/test/frame_pred_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--dataset k600_val \
--sample-dir ${OUT_DIR} \
--per-proc-batch-size ${BSZ} \
--orig-aspect-ratio \
--check-special-token-mask \
--sample-num 20 \
--fixed-prefix "512,128" \
--condition-mode padding

Detailed Evaluation and Training Scripts

See Detailed_instructions

Acknowledgements

This codebase is built on GigaTok. Important reference codebases for this project include LlamaGen, REPA, DETR, vaex, LARP, VideoMAE.
We use video semantic encoders to enhance tokenizer training. The VideoMAE-B model from InternVideo and V-JEPA2 are used and are very helpful.

License

This project is licensed under the Apache 2.0 license - see the LICENSE file for details.

Citation

@article{xiong2025evatok,
  title={EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation},
  author={Xiong, Tianwei and Liew, Jun Hao and Huang, Zilong and Lin, Zhijie and Feng, Jiashi and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.12267},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
assets/images		assets/images
autoregressive		autoregressive
configs		configs
datasets		datasets
evaluations/c2i		evaluations/c2i
metrics		metrics
scripts		scripts
tokenizer		tokenizer
utils		utils
videomae		videomae
Detailed_instructions.md		Detailed_instructions.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

🔈News

Introduction

Environment setup

Download Checkpoints

Tokenizers and Routers

AR models Downloading

For UCF-101 Class-to-video Generation

For Kinetics-600 Frame Prediction

Inference and Evaluation

Tokenizer Reconstruction

Qualitative UCF-101 Class-to-video Generation

Qualitative Frame Prediction

K600 Frame Prediction

Detailed Evaluation and Training Scripts

Acknowledgements

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

🔈News

Introduction

Environment setup

Download Checkpoints

Tokenizers and Routers

AR models Downloading

For UCF-101 Class-to-video Generation

For Kinetics-600 Frame Prediction

Inference and Evaluation

Tokenizer Reconstruction

Qualitative UCF-101 Class-to-video Generation

Qualitative Frame Prediction

K600 Frame Prediction

Detailed Evaluation and Training Scripts

Acknowledgements

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages