1The University of Hong Kong 2ByteDance Seed
*Work partly done as an Intern at ByteDance Seed. ✉ Corresponding author
CVPR 2026
- [2026/03/17] Research paper, code, and models are released for EVATok!
We introduce EVATok, a framework that adaptively tokenizes videos into quality-cost optimal sequences. We show that content-adaptive video tokenization can surpass fixed-length baselines, achieving superior overall performances in reconstruction and downstream AR generation with fewer tokens.
🚀 In this codebase, we release
- Routers that predict the optimal token assignment for each video clip.
- Adaptive length video tokenizer and AR generative models.
- Advanced video tokenizer training improvement implementations using video semantic encoders.
- Comprehensive implementations for convenient exploration of adaptive tokenizer training and evaluation.
conda create --name evatok python=3.10
conda activate evatok
# works for CUDA 12.2
bash scripts/env_install.shAll the video tokenizers and routers are for 16x128x128 videos.
| Tokenizer | Train Set | Config | Param. (Tokenizer) | router config | router ckpt (link) | #rTokens | rFVD | LPIPS | Tokenizer ckpt (link) |
|---|---|---|---|---|---|---|---|---|---|
| S-B | WebVid-10M | VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml | 145M | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt | 721 | 7.3 | 0.1063 | VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt |
| S-B | UCF-101 & K600 | VQ_SB_final_with_router_w_lpips_1.2.yaml | 145M | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt | 774 | 9.7 | 0.1140 | VQ_SB_final_with_router_ucf_k600_1000k.pt |
| S-B (Proxy) | WebVid-10M | VQ_SB_proxy_3fps.yaml | 145M | - | - | - | - | - | VQ_SB_proxy_3fps_webvid_400k.pt |
Note that the inference of AR models will not use routers.
| AR Model | Param. (AR) | gFVD | #gTokens | AR Model Download Link | Tok. ckpt | Tok. Config | router config | router ckpt |
|---|---|---|---|---|---|---|---|---|
| GPT-L-plus | 633M | 48 | 756 | GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
| GPT-L | 327M | 62 | 756 | GPT_L_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.
The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.
| AR Model | Param. (AR) | gFVD | #gTokens | AR Model Download Link | Tok. ckpt | Tok. Config | router config | router ckpt |
|---|---|---|---|---|---|---|---|---|
| GPT-L-plus | 633M | 4.0 | 862 | GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
| GPT-L | 327M | 4.6 | 862 | GPT_L_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
We provide easy-to-run qualitative evaluation scripts below. More quantitative evaluation scripts can be found in Detailed_instructions.
To perform tokenizer reconstruction, you need to set up the required environment variables and run the reconstruction script.
For environment setup, modify the set_env_vars_template.sh script into set_env_vars.sh according to the comments in it. For this reconstruction task, you only need to specify the following variables:
PROJECT_ROOT and PYPATH.
The script below will reconstruct videos in A_DIR_OF_VIDEOS using assignments adaptively predicted by the router. And the results will be saved in OUT_DIR, including reconstructed videos and the original video clips (automatically cropped to 16x128x128).
. scripts/set_env_vars.sh
# tokenizer: trained on WebVid-10M
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml
export VQ_CKPT=ckpts/VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt
# tokenizer: trained on UCF and K600
# export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
# export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt
# router: dependent on V-JEPA2
export ROUTER_CONFIG="configs/router/router_w_lpips_1.2.yaml"
export ROUTER_CKPT="ckpts/router_w_lpips_1.2_50k.pt"
# router: not dependent on V-JEPA2
# export ROUTER_CONFIG="configs/router/router_w_lpips_1.2_raw.yaml"
# export ROUTER_CKPT=ckpts/router_w_lpips_l1.2_raw_50k.pt
$PYPATH tokenizer/router/reconstruction_qual_with_router.py \
--vid_path A_DIR_OF_VIDEOS \
--save_dir OUT_DIR \
--router_config ${ROUTER_CONFIG} \
--router_ckpt ${ROUTER_CKPT} \
--tok_config ${TOK_CONFIG} \
--vq_ckpt ${VQ_CKPT}For the quantitative reconstruction evaluation, see Detailed_instructions
. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt
export GPT_MODEL=GPT-LP
export CFG=2.25
export GPT_CKPT="ckpts/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt"
export OUT_DIR=results/c2v_adaptive_qual_eval/GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000_cfg${CFG}
export PRECISION="none"
export BSZ=8
# Try these classes by setting the --class-idx
# "ApplyLipstick": 1,
# "ApplyEyeMakeup": 0,
# "PushUps": 71,
# "WallPushups": 98,
# "JumpRope": 46,
# "BenchPress": 9,
# "PlayingGuitar": 62,
# "PlayingViolin": 66,
# "PlayingCello": 58,
# "PlayingSitar": 64,
# "SoccerJuggling": 83,
# "PullUps": 69,
# "Typing": 94,
# "Mixing": 53,
# "TableTennisShot": 89,
# "Rafting": 72,
# "WritingOnBoard": 99,
# "BodyWeightSquats": 14,
# "CuttingInKitchen": 24,
bash scripts/test/sample_c2v_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--sample-dir ${OUT_DIR} \
--cfg-scale ${CFG} \
--per-proc-batch-size ${BSZ} \
--qual-num 20 \
--class-idx 1,0,71,98,14 \
--check-special-token-maskMake sure you have the K600 dataset prepared and the K600_VAL_FILE and K600_ROOT set in the set_env_vars.sh script.
. scripts/set_env_vars.sh
export TOK_CONFIG=configs/vq/VQ_SB_final_with_router_w_lpips_1.2.yaml
export VQ_CKPT=ckpts/VQ_SB_final_with_router_ucf_k600_1000k.pt
export GPT_MODEL=GPT-LP
export GPT_CKPT="ckpts/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt"
export BSZ=8
export OUT_DIR=results/fp_adaptive_qual_eval/GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75
bash scripts/test/frame_pred_visualization.sh \
--tok-config ${TOK_CONFIG} \
--vq-ckpt ${VQ_CKPT} \
--gpt-model ${GPT_MODEL} \
--gpt-ckpt ${GPT_CKPT} \
--dataset k600_val \
--sample-dir ${OUT_DIR} \
--per-proc-batch-size ${BSZ} \
--orig-aspect-ratio \
--check-special-token-mask \
--sample-num 20 \
--fixed-prefix "512,128" \
--condition-mode padding- This codebase is built on GigaTok. Important reference codebases for this project include LlamaGen, REPA, DETR, vaex, LARP, VideoMAE.
- We use video semantic encoders to enhance tokenizer training. The VideoMAE-B model from InternVideo and V-JEPA2 are used and are very helpful.
This project is licensed under the Apache 2.0 license - see the LICENSE file for details.
@article{xiong2025evatok,
title={EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation},
author={Xiong, Tianwei and Liew, Jun Hao and Huang, Zilong and Lin, Zhijie and Feng, Jiashi and Liu, Xihui},
journal={arXiv preprint arXiv:2603.12267},
year={2026}
}

