LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
LARY is a unified evaluation framework for latent action representations. Given any model that produces latent action representations (LAMs or visual encoders), LARY provides three complementary evaluation pipelines:
| Pipeline | Task |
|---|---|
get_latent_action |
Extract latent action representations from videos or image pairs |
classification |
Probe how well latent actions capture action semantics (action-type recognition) |
regression |
Probe how well latent actions can decode physical robot actions (action regression) |
- [2026-05-01] LARYBench now supports SigLIP2, relative-action regression evaluation (
target = action_tgt - action_src), and a fast dataset integrity checker. Happy Labor Day! - [2026-04-27] We have open-sourced all datasets on HuggingFace.
- [2026-04-21] We release the general LAMs trained in ablation studies, LAPA-DINOv3 and LAPA-DINOv2. Even though these models are still rough experimental prototypes, with clear flaws in both training data and methods, we’re sharing them anyway to help push latent action research forward together. Have fun~
- [2026-04-15] We release partial training datasets due to the license limitation.
- [2026-04-13] We release the code, text annotations, and partial validation datasets. Training datasets are coming soon.
- Code
- Text annotations link
- Partial Validation datasets
- Partial Training datasets
- Full datasets
- Overview
- Contributions
- Environment Setup
- Data Preparation
- Quick Start
- Relative-Action Regression
- Supported Models
- Adding a Custom Model
- Supported Datasets
- Evaluation Outputs
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated.
We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
-
LARYBench: We introduce LARYBench, a comprehensive benchmark that first decouples the evaluation of latent action representations from downstream policy performance. LARYBench probes representations along two complementary dimensions — high-level semantic action (what to do) encoding and the low-level physical dynamics required for robotic control (how to do it) — enabling direct, standardized measurement of representation quality itself.
-
Large-Scale Data Engine: To support rigorous evaluation, we develop an automated data engine to re-segment and re-annotate a large-scale corpus, yielding 1.2M videos, 620K image pairs, and 595K trajectories across 151 action categories and 11 robotic embodiments, covering both human and robotic agents from egocentric and exocentric perspectives in simulated and real-world environments.
-
Key Findings: Through systematic evaluation of 11 models, we reveal two consistent findings: (i) action-relevant features can emerge from large-scale visual pre-training without explicit action supervision, and (ii) latent-based feature spaces tend to align with robotic control better than pixel-based ones. These results suggest that future VLA systems may benefit more from leveraging general visual representations than from learning action spaces solely on scarce robotic data.
Use larybench as the base environment.
conda create -n larybench python=3.10 -y
conda activate larybench
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtSome model families keep their original dependencies and should be configured from their upstream projects when you evaluate them:
| Model family | Environment guidance |
|---|---|
dinov2, dinov3, siglip2, dinov2-origin, dinov3-origin, siglip2-origin, lapa, magvit2, univla, flux2 |
Use larybench |
vjepa2 |
Follow facebookresearch/vjepa2 and activate your vjepa2 env |
wan2-2 |
Follow Wan-Video/Wan2.2 and activate your wan env |
villa-x |
Follow microsoft/villa-x and set VILLA_X_DIR |
Configure paths in env.sh, then source it before running commands. Example:
LARY_ROOT=/your_name/code/LARYBench
LARY_LOG_DIR=/your_data_disk/LARYBench/logs
DATA_DIR=/your_data_disk/LARYBench/data
MODEL_DIR=/your_data_disk/LARYBench/models
LARY_LA_DIR=/your_data_disk/LARYBench/latent_actions
DINO_V2_PATH=/your_data_disk/LARYBench/models/DINOv2
DINO_V3_PATH=/your_data_disk/LARYBench/models/DINOv3
SIGLIP2_PATH=/your_data_disk/LARYBench/models/SigLIP2
source env.shThe dataset root should be DATA_DIR:
/your_data_disk/LARYBench/data/
├── classification/
│ ├── AgiBotWorld-Beta/
│ ├── Ego4D/
│ ├── EgoDex/
│ ├── EPIC-KITCHENS/
│ ├── HoloAssist/
│ ├── SSv2/
│ └── TACO/
├── regression/
│ ├── agibot_45/
│ ├── calvin/{train_stride5,val_stride5}/
│ ├── robocoin_10/
│ └── vlabench/
└── regression_relative/ # optional; generated for relative-action regression
Metadata CSVs are committed in this repository under data/. They store relative paths and are resolved against DATA_DIR at runtime.
Data setup flow:
- Download the LARYBench archives from HuggingFace or ModelScope.
- Extract them so the folder layout matches the example above.
- Download SSv2 and EgoDex separately, then generate their clipped videos. This is required for the
Human_1stclassification task.
python utils/prepare_ssv2_egodex.py \
--ssv2-root /path/to/20bn-something-something-v2 \
--egodex-root /path/to/EgoDex \
--output-dir $DATA_DIR/classification \
--workers 16For AgibotBeta and RoboCOIN regression, compute per-robot action normalization statistics from the train split before training absolute-action probes. The outputs are read automatically by lary.cli regress.
python utils/compute_robot_action_stats.py \
--dataset agibotbeta \
--data-root $DATA_DIR
python utils/compute_robot_action_stats.py \
--dataset robocoin \
--data-root $DATA_DIRThis writes DATA_DIR/regression/agibot_45/agibotbeta_stats.json and DATA_DIR/regression/robocoin_10/robocoin_stats.json.
To check whether images, videos, and regression .npy files exist and can be opened, run the integrity checker. Use --groups to scan only selected datasets.
python utils/check_dataset_integrity.py \
--data-root $DATA_DIR \
--output dataset_integrity_report.txt \
--workers 64 \
--timeout 3
python utils/check_dataset_integrity.py \
--data-root $DATA_DIR \
--groups CALVIN \
--output dataset_integrity_calvin.txt \
--workers 64 \
--timeout 3The evaluation has two steps:
- Extract latent actions from image pairs or videos with the model you want to evaluate.
- Train a lightweight probe on the extracted latent actions: classification for action semantics, or regression for physical actions.
GPU defaults are explicit in the examples below. extract is single-GPU by default (--gpus 0); pass a comma-separated list (--gpus 0,1,2,3,4,5,6,7) to start one extraction partition per GPU and merge partition CSVs automatically. classify defaults to --gpus 0,1,2,3,4,5,6,7, so pass --gpus 0 for single-card probing. regress follows CUDA_VISIBLE_DEVICES; if it is unset, the CLI assumes 0,1,2,3,4,5,6,7, so set it explicitly for single-card runs.
conda activate larybench
source env.sh
# Step 1: extract latent actions from image pairs.
# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.
# --stride: calvin=5, vlabench=5, agibotbeta=45, robocoin=10.
# --split: calvin/vlabench use train,val; agibotbeta/robocoin use seen_train,seen_val.
# Single GPU by default; use --gpus 0,1,2,3 for multi-GPU partitioned extraction.
python -m lary.cli extract \
--model dinov2 \
--dataset calvin \
--split train \
--mode image \
--stride 5 \
--gpus 0,1,2,3
python -m lary.cli extract \
--model dinov2 \
--dataset calvin \
--split val \
--mode image \
--stride 5 \
--gpus 0,1,2,3
# Step 2: train the regression probe.
# Uses CUDA_VISIBLE_DEVICES; one visible GPU means single-card training, multiple visible GPUs use accelerate.
# Keep --dataset and --stride consistent with the extracted CSV names.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m lary.cli regress \
--model dinov2 \
--dataset calvin \
--stride 5 \
--model-type mlpThe extraction step writes latent-action .npz files under $LARY_LA_DIR and CSVs such as data/train_la_calvin_5_dinov2.csv. Regression logs and metrics are written under $LARY_LOG_DIR/regression/.
conda activate larybench
source env.sh
# Step 1: extract latent actions from videos.
# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.
# --dataset: robot_1st, human_1st
python -m lary.cli extract \
--model dinov2 \
--dataset robot_1st \
--split train \
--mode video \
--gpus 0,1,2,3
python -m lary.cli extract \
--model dinov2 \
--dataset robot_1st \
--split val \
--mode video \
--gpus 0,1,2,3
# Step 2: train the classification probe.
# Robot_1st has 54 classes; Human_1st has 123 classes.
# --dim: dinov2=1024, dinov3=1024, siglip2=768, magvit2=18,
# lapa=1024, univla=128, villa-x:32,
# dinov2-origin=1024, dinov3-origin=1024, vjepa2=1024,
# siglip2-origin=768, wan2-2=48, flux2=128
python -m lary.cli classify \
--model dinov2 \
--dataset robot_1st \
--dim 1024 \
--classes 54 \
--gpus 0,1,2,3Classification outputs are written under $LARY_LOG_DIR/classification/.
Absolute regression predicts the absolute action chunk. Relative regression predicts relative motion between two frames. Generate non-overwriting relative-action files first:
python utils/prepare_relative_actions.py \
--dataset calvin \
--input-root $DATA_DIR \
--output-root $DATA_DIR \
--csv data/train_la_calvin_5_dinov2.csv \
--csv data/val_la_calvin_5_dinov2.csv \
--workers 32This creates DATA_DIR/regression_relative/... and writes relative-action mean/std statistics, for example DATA_DIR/regression_relative/calvin/relative_action_stats_calvin.json.
Run relative regression with the same latent-action CSVs:
CUDA_VISIBLE_DEVICES=0 python -m lary.cli regress \
--model dinov2 \
--dataset calvin \
--stride 5 \
--model-type mlp \
--action-mode relative| Model key | What it extracts | Environment |
|---|---|---|
dinov2 |
LAPA-DINOv2 latent actions | larybench |
dinov3 |
LAPA-DINOv3 latent actions | larybench |
siglip2 |
LAPA-SigLIP2 latent actions | larybench |
magvit2 |
Open-MAGVIT2 based latent actions | larybench; set MAGVIT2_CONFIG_PATH and MAGVIT2_TOKENIZER_PATH |
dinov2-origin |
Raw DINOv2 visual features | larybench |
dinov3-origin |
Raw DINOv3 visual features | larybench |
siglip2-origin |
Raw SigLIP2 visual features | larybench |
lapa |
LAPA / LAQ latent actions | larybench |
univla |
UniVLA latent actions | larybench; set UNIVLA_CKPT_PATH |
villa-x |
villa-X latent actions | upstream villa-X env |
flux2 |
FLUX.2 VAE features | larybench; set AE_MODEL_PATH |
vjepa2 |
V-JEPA2 video features | upstream vjepa2 env |
wan2-2 |
Wan2.2 VAE features | upstream wan env |
LARYBench only needs your model to convert a video or image pair into a numeric tokens array saved in each latent-action .npz file.
- Add model-specific imports in get_latent_action/dynamics.py, guarded by
USE_MODELif the dependency is optional.
env_model = os.environ.get("USE_MODEL")
if env_model == "my-model":
from my_project import MyModel- Register the model loader in
get_dynamic_tokenizer(model).
elif model == "my-model":
dynamics = MyModel.from_pretrained(os.environ["MY_MODEL_CKPT"]).cuda()- Add the forward branch in
get_latent_action(x, tokenizer, model_name)and return either(tokens, indices)ortokens. Classification and regression usetokens;tokens.shape[-1]is the--dimvalue for classification.
elif model_name == "my-model":
tokens = tokenizer(x) # expected shape: (B, ..., D)
indices = np.array([])-
If the model needs a different input format, add a matching branch in lary/extract.py for dataset preprocessing and batch execution. Reuse existing branches such as
dinov2-origin,vjepa2, orwan2-2as templates. -
Set any required environment variables in
env.sh, then run:
python -m lary.cli extract \
--model my-model \
--dataset calvin \
--split train/val \
--mode image \
--stride 5 \
--gpus 0After extraction creates data/val_la_<dataset>_<stride>_my-model.csv or data/val_la_<dataset>_my-model.csv, the existing classify and regress commands can evaluate it without model-specific changes.
| Dataset key | Splits | Input mode | Notes |
|---|---|---|---|
human_1st |
train, val |
video | 123-class. Including EgoDex, SSv2, Ego4D, HoloAssist, EPIC-KITCHENS, TACO |
robot_1st |
train, val |
video | 54-class. Made by AgiBotWorld-Beta |
| Dataset key | Splits | Stride |
|---|---|---|
calvin |
train, val |
5 |
vlabench |
train, val |
5 |
vlabench_15 |
train, val |
15 |
vlabench_30 |
train, val |
30 |
agibotbeta |
seen_train, seen_val |
45 |
robocoin |
seen_train, seen_val |
10 |
Extraction creates .npz latent actions under $LARY_LA_DIR and a metadata CSV under this repository's data/ directory. Classification writes checkpoints, logs, confusion matrices, and class metrics under $LARY_LOG_DIR/classification/. Regression writes checkpoints, best-result CSVs, and trajectory visualizations under $LARY_LOG_DIR/regression/.
If you find this work useful, please cite:
@misc{nie2026larylatentactionrepresentation,
title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment},
author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
year={2026},
eprint={2604.11689},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.11689},
}LARYBench is built upon the following publicly available datasets. We gratefully acknowledge the efforts of their creators and ask users to comply with each dataset's respective license and terms of use.
| Dataset | Link |
|---|---|
| EgoDex | github.com/apple/ml-egodex |
| Something-Something V2 | something-something-v2 |
| Ego4D | github.com/facebookresearch/Ego4d |
| HoloAssist | holoassist.github.io |
| EPIC-KITCHENS | epic-kitchens.github.io |
| TACO | taco2024.github.io |
| AgiBotWorld-Beta | github.com/OpenDriveLab/AgiBot-World |
| LIBERO | github.com/Lifelong-Robot-Learning/LIBERO |
| RoboCOIN | github.com/FlagOpen/RoboCOIN |
| VLABench | github.com/OpenMOSS/VLABench |
| CALVIN | github.com/mees/calvin |
The code and tools in this repository are released under the MIT License.
However, this dataset is derived from multiple third-party datasets, each governed by its own license. The overall dataset is subject to the most restrictive terms among all included sources. Users must comply with the respective licenses for each subset.
| Dataset | License |
|---|---|
| EPIC-KITCHENS | CC BY-NC 4.0 |
| TACO | CC BY 4.0 |
| AgiBotWorld-Beta | CC BY-NC-SA 4.0 |
| Ego4D, HoloAssist, LIBERO, RoboCOIN, VLABench, CALVIN | MIT |
-
Non-commercial use only: Subsets derived from EPIC-KITCHENS, and AgiBotWorld-Beta are restricted to non-commercial research and educational purposes only, due to the NC (NonCommercial) clauses in their respective licenses.
-
ShareAlike: The AgiBotWorld-Beta-derived subset is subject to the SA (ShareAlike) clause. Any redistribution of this subset must be made available under the same CC BY-NC-SA 4.0 license.
-
Attribution required: All subsets derived from Creative Commons-licensed sources require proper attribution to the original dataset authors.
If you intend to use this dataset for commercial purposes, please use only the subsets released under MIT or CC BY 4.0 licenses (i.e., TACO and other datasets). The remaining subsets are strictly non-commercial.
For any questions regarding licensing, please refer to the original dataset sources or contact the respective dataset authors.
We thank the following open-source projects for their contributions:
Please contact us at longcat-team@meituan.com or join our WeChat Group if you have any questions.


