LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

LARY is a unified evaluation framework for latent action representations. Given any model that produces latent action representations (LAMs or visual encoders), LARY provides three complementary evaluation pipelines:

Pipeline	Task
`get_latent_action`	Extract latent action representations from videos or image pairs
`classification`	Probe how well latent actions capture action semantics (action-type recognition)
`regression`	Probe how well latent actions can decode physical robot actions (action regression)

News

[2026-05-01] LARYBench now supports SigLIP2, relative-action regression evaluation (target = action_tgt - action_src), and a fast dataset integrity checker. Happy Labor Day!
[2026-04-27] We have open-sourced all datasets on HuggingFace.
[2026-04-21] We release the general LAMs trained in ablation studies, LAPA-DINOv3 and LAPA-DINOv2. Even though these models are still rough experimental prototypes, with clear flaws in both training data and methods, we’re sharing them anyway to help push latent action research forward together. Have fun~
[2026-04-15] We release partial training datasets due to the license limitation.
[2026-04-13] We release the code, text annotations, and partial validation datasets. Training datasets are coming soon.

Release Checklist

Overview

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated.

We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

Contributions

LARYBench: We introduce LARYBench, a comprehensive benchmark that first decouples the evaluation of latent action representations from downstream policy performance. LARYBench probes representations along two complementary dimensions — high-level semantic action (what to do) encoding and the low-level physical dynamics required for robotic control (how to do it) — enabling direct, standardized measurement of representation quality itself.
Large-Scale Data Engine: To support rigorous evaluation, we develop an automated data engine to re-segment and re-annotate a large-scale corpus, yielding 1.2M videos, 620K image pairs, and 595K trajectories across 151 action categories and 11 robotic embodiments, covering both human and robotic agents from egocentric and exocentric perspectives in simulated and real-world environments.
Key Findings: Through systematic evaluation of 11 models, we reveal two consistent findings: (i) action-relevant features can emerge from large-scale visual pre-training without explicit action supervision, and (ii) latent-based feature spaces tend to align with robotic control better than pixel-based ones. These results suggest that future VLA systems may benefit more from leveraging general visual representations than from learning action spaces solely on scarce robotic data.

Environment Setup

Use larybench as the base environment.

conda create -n larybench python=3.10 -y
conda activate larybench
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Some model families keep their original dependencies and should be configured from their upstream projects when you evaluate them:

Model family	Environment guidance
`dinov2`, `dinov3`, `siglip2`, `dinov2-origin`, `dinov3-origin`, `siglip2-origin`, `lapa`, `magvit2`, `univla`, `flux2`	Use `larybench`
`vjepa2`	Follow facebookresearch/vjepa2 and activate your `vjepa2` env
`wan2-2`	Follow Wan-Video/Wan2.2 and activate your `wan` env
`villa-x`	Follow microsoft/villa-x and set `VILLA_X_DIR`

Configure paths in env.sh, then source it before running commands. Example:

LARY_ROOT=/your_name/code/LARYBench
LARY_LOG_DIR=/your_data_disk/LARYBench/logs
DATA_DIR=/your_data_disk/LARYBench/data
MODEL_DIR=/your_data_disk/LARYBench/models
LARY_LA_DIR=/your_data_disk/LARYBench/latent_actions
DINO_V2_PATH=/your_data_disk/LARYBench/models/DINOv2
DINO_V3_PATH=/your_data_disk/LARYBench/models/DINOv3
SIGLIP2_PATH=/your_data_disk/LARYBench/models/SigLIP2
source env.sh

Data Preparation

The dataset root should be DATA_DIR:

/your_data_disk/LARYBench/data/
├── classification/
│   ├── AgiBotWorld-Beta/
│   ├── Ego4D/
│   ├── EgoDex/
│   ├── EPIC-KITCHENS/
│   ├── HoloAssist/
│   ├── SSv2/
│   └── TACO/
├── regression/
│   ├── agibot_45/
│   ├── calvin/{train_stride5,val_stride5}/
│   ├── robocoin_10/
│   └── vlabench/
└── regression_relative/        # optional; generated for relative-action regression

Metadata CSVs are committed in this repository under data/. They store relative paths and are resolved against DATA_DIR at runtime.

Data setup flow:

Download the LARYBench archives from HuggingFace or ModelScope.
Extract them so the folder layout matches the example above.
Download SSv2 and EgoDex separately, then generate their clipped videos. This is required for the Human_1st classification task.

python utils/prepare_ssv2_egodex.py \
  --ssv2-root /path/to/20bn-something-something-v2 \
  --egodex-root /path/to/EgoDex \
  --output-dir $DATA_DIR/classification \
  --workers 16

For AgibotBeta and RoboCOIN regression, compute per-robot action normalization statistics from the train split before training absolute-action probes. The outputs are read automatically by lary.cli regress.

python utils/compute_robot_action_stats.py \
  --dataset agibotbeta \
  --data-root $DATA_DIR

python utils/compute_robot_action_stats.py \
  --dataset robocoin \
  --data-root $DATA_DIR

This writes DATA_DIR/regression/agibot_45/agibotbeta_stats.json and DATA_DIR/regression/robocoin_10/robocoin_stats.json.

To check whether images, videos, and regression .npy files exist and can be opened, run the integrity checker. Use --groups to scan only selected datasets.

python utils/check_dataset_integrity.py \
  --data-root $DATA_DIR \
  --output dataset_integrity_report.txt \
  --workers 64 \
  --timeout 3

python utils/check_dataset_integrity.py \
  --data-root $DATA_DIR \
  --groups CALVIN \
  --output dataset_integrity_calvin.txt \
  --workers 64 \
  --timeout 3

Quick Start

The evaluation has two steps:

Extract latent actions from image pairs or videos with the model you want to evaluate.
Train a lightweight probe on the extracted latent actions: classification for action semantics, or regression for physical actions.

GPU defaults are explicit in the examples below. extract is single-GPU by default (--gpus 0); pass a comma-separated list (--gpus 0,1,2,3,4,5,6,7) to start one extraction partition per GPU and merge partition CSVs automatically. classify defaults to --gpus 0,1,2,3,4,5,6,7, so pass --gpus 0 for single-card probing. regress follows CUDA_VISIBLE_DEVICES; if it is unset, the CLI assumes 0,1,2,3,4,5,6,7, so set it explicitly for single-card runs.

Example A: Regression on CALVIN with LAPA-DINOv2

conda activate larybench
source env.sh

# Step 1: extract latent actions from image pairs.
# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.
# --stride: calvin=5, vlabench=5, agibotbeta=45, robocoin=10.
# --split: calvin/vlabench use train,val; agibotbeta/robocoin use seen_train,seen_val.
# Single GPU by default; use --gpus 0,1,2,3 for multi-GPU partitioned extraction.
python -m lary.cli extract \
  --model dinov2 \
  --dataset calvin \
  --split train \
  --mode image \
  --stride 5 \
  --gpus 0,1,2,3

python -m lary.cli extract \
  --model dinov2 \
  --dataset calvin \
  --split val \
  --mode image \
  --stride 5 \
  --gpus 0,1,2,3

# Step 2: train the regression probe.
# Uses CUDA_VISIBLE_DEVICES; one visible GPU means single-card training, multiple visible GPUs use accelerate.
# Keep --dataset and --stride consistent with the extracted CSV names.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m lary.cli regress \
  --model dinov2 \
  --dataset calvin \
  --stride 5 \
  --model-type mlp

The extraction step writes latent-action .npz files under $LARY_LA_DIR and CSVs such as data/train_la_calvin_5_dinov2.csv. Regression logs and metrics are written under $LARY_LOG_DIR/regression/.

Example B: Classification on Robot_1st with LAPA-DINOv2

conda activate larybench
source env.sh

# Step 1: extract latent actions from videos.
# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.
# --dataset: robot_1st, human_1st

python -m lary.cli extract \
  --model dinov2 \
  --dataset robot_1st \
  --split train \
  --mode video \
  --gpus 0,1,2,3

python -m lary.cli extract \
  --model dinov2 \
  --dataset robot_1st \
  --split val \
  --mode video \
  --gpus 0,1,2,3

# Step 2: train the classification probe.
# Robot_1st has 54 classes; Human_1st has 123 classes.
# --dim: dinov2=1024, dinov3=1024, siglip2=768, magvit2=18, 
#        lapa=1024, univla=128, villa-x:32,
#        dinov2-origin=1024, dinov3-origin=1024, vjepa2=1024,
#        siglip2-origin=768, wan2-2=48, flux2=128
python -m lary.cli classify \
  --model dinov2 \
  --dataset robot_1st \
  --dim 1024 \
  --classes 54 \
  --gpus 0,1,2,3

Classification outputs are written under $LARY_LOG_DIR/classification/.

Relative-Action Regression

Absolute regression predicts the absolute action chunk. Relative regression predicts relative motion between two frames. Generate non-overwriting relative-action files first:

python utils/prepare_relative_actions.py \
  --dataset calvin \
  --input-root $DATA_DIR \
  --output-root $DATA_DIR \
  --csv data/train_la_calvin_5_dinov2.csv \
  --csv data/val_la_calvin_5_dinov2.csv \
  --workers 32

This creates DATA_DIR/regression_relative/... and writes relative-action mean/std statistics, for example DATA_DIR/regression_relative/calvin/relative_action_stats_calvin.json.

Run relative regression with the same latent-action CSVs:

CUDA_VISIBLE_DEVICES=0 python -m lary.cli regress \
  --model dinov2 \
  --dataset calvin \
  --stride 5 \
  --model-type mlp \
  --action-mode relative

Supported Models

Model key	What it extracts	Environment
`dinov2`	LAPA-DINOv2 latent actions	`larybench`
`dinov3`	LAPA-DINOv3 latent actions	`larybench`
`siglip2`	LAPA-SigLIP2 latent actions	`larybench`
`magvit2`	Open-MAGVIT2 based latent actions	`larybench`; set `MAGVIT2_CONFIG_PATH` and `MAGVIT2_TOKENIZER_PATH`
`dinov2-origin`	Raw DINOv2 visual features	`larybench`
`dinov3-origin`	Raw DINOv3 visual features	`larybench`
`siglip2-origin`	Raw SigLIP2 visual features	`larybench`
`lapa`	LAPA / LAQ latent actions	`larybench`
`univla`	UniVLA latent actions	`larybench`; set `UNIVLA_CKPT_PATH`
`villa-x`	villa-X latent actions	upstream villa-X env
`flux2`	FLUX.2 VAE features	`larybench`; set `AE_MODEL_PATH`
`vjepa2`	V-JEPA2 video features	upstream `vjepa2` env
`wan2-2`	Wan2.2 VAE features	upstream `wan` env

Adding a Custom Model

LARYBench only needs your model to convert a video or image pair into a numeric tokens array saved in each latent-action .npz file.

Add model-specific imports in get_latent_action/dynamics.py, guarded by USE_MODEL if the dependency is optional.

env_model = os.environ.get("USE_MODEL")
if env_model == "my-model":
    from my_project import MyModel

Register the model loader in get_dynamic_tokenizer(model).

elif model == "my-model":
    dynamics = MyModel.from_pretrained(os.environ["MY_MODEL_CKPT"]).cuda()

Add the forward branch in get_latent_action(x, tokenizer, model_name) and return either (tokens, indices) or tokens. Classification and regression use tokens; tokens.shape[-1] is the --dim value for classification.

elif model_name == "my-model":
    tokens = tokenizer(x)          # expected shape: (B, ..., D)
    indices = np.array([])

If the model needs a different input format, add a matching branch in lary/extract.py for dataset preprocessing and batch execution. Reuse existing branches such as dinov2-origin, vjepa2, or wan2-2 as templates.
Set any required environment variables in env.sh, then run:

python -m lary.cli extract \
  --model my-model \
  --dataset calvin \
  --split train/val \
  --mode image \
  --stride 5 \
  --gpus 0

After extraction creates data/val_la_<dataset>_<stride>_my-model.csv or data/val_la_<dataset>_my-model.csv, the existing classify and regress commands can evaluate it without model-specific changes.

Supported Datasets

Classification Datasets

Dataset key	Splits	Input mode	Notes
`human_1st`	`train`, `val`	video	123-class. Including EgoDex, SSv2, Ego4D, HoloAssist, EPIC-KITCHENS, TACO
`robot_1st`	`train`, `val`	video	54-class. Made by AgiBotWorld-Beta

Regression Datasets

Dataset key	Splits	Stride
`calvin`	`train`, `val`	5
`vlabench`	`train`, `val`	5
`vlabench_15`	`train`, `val`	15
`vlabench_30`	`train`, `val`	30
`agibotbeta`	`seen_train`, `seen_val`	45
`robocoin`	`seen_train`, `seen_val`	10

Evaluation Outputs

Extraction creates .npz latent actions under $LARY_LA_DIR and a metadata CSV under this repository's data/ directory. Classification writes checkpoints, logs, confusion matrices, and class metrics under $LARY_LOG_DIR/classification/. Regression writes checkpoints, best-result CSVs, and trajectory visualizations under $LARY_LOG_DIR/regression/.

Citation

If you find this work useful, please cite:

@misc{nie2026larylatentactionrepresentation,
      title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment}, 
      author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
      year={2026},
      eprint={2604.11689},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.11689}, 
}

Data Statements

LARYBench is built upon the following publicly available datasets. We gratefully acknowledge the efforts of their creators and ask users to comply with each dataset's respective license and terms of use.

Dataset	Link
EgoDex	github.com/apple/ml-egodex
Something-Something V2	something-something-v2
Ego4D	github.com/facebookresearch/Ego4d
HoloAssist	holoassist.github.io
EPIC-KITCHENS	epic-kitchens.github.io
TACO	taco2024.github.io
AgiBotWorld-Beta	github.com/OpenDriveLab/AgiBot-World
LIBERO	github.com/Lifelong-Robot-Learning/LIBERO
RoboCOIN	github.com/FlagOpen/RoboCOIN
VLABench	github.com/OpenMOSS/VLABench
CALVIN	github.com/mees/calvin

License

The code and tools in this repository are released under the MIT License.

However, this dataset is derived from multiple third-party datasets, each governed by its own license. The overall dataset is subject to the most restrictive terms among all included sources. Users must comply with the respective licenses for each subset.

Dataset License Summary

Dataset	License
EPIC-KITCHENS	CC BY-NC 4.0
TACO	CC BY 4.0
AgiBotWorld-Beta	CC BY-NC-SA 4.0
Ego4D, HoloAssist, LIBERO, RoboCOIN, VLABench, CALVIN	MIT

Important Notices

Non-commercial use only: Subsets derived from EPIC-KITCHENS, and AgiBotWorld-Beta are restricted to non-commercial research and educational purposes only, due to the NC (NonCommercial) clauses in their respective licenses.
ShareAlike: The AgiBotWorld-Beta-derived subset is subject to the SA (ShareAlike) clause. Any redistribution of this subset must be made available under the same CC BY-NC-SA 4.0 license.
Attribution required: All subsets derived from Creative Commons-licensed sources require proper attribution to the original dataset authors.

Usage Recommendation

If you intend to use this dataset for commercial purposes, please use only the subsets released under MIT or CC BY 4.0 licenses (i.e., TACO and other datasets). The remaining subsets are strictly non-commercial.

For any questions regarding licensing, please refer to the original dataset sources or contact the respective dataset authors.

Acknowledgements

We thank the following open-source projects for their contributions:

Support

Please contact us at longcat-team@meituan.com or join our WeChat Group if you have any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

News

Release Checklist

Table of Contents

Overview

Contributions

Environment Setup

Data Preparation

Quick Start

Example A: Regression on CALVIN with LAPA-DINOv2

Example B: Classification on Robot_1st with LAPA-DINOv2

Relative-Action Regression

Supported Models

Adding a Custom Model

Supported Datasets

Classification Datasets

Regression Datasets

Evaluation Outputs

Citation

Data Statements

License

Dataset License Summary

Important Notices

Usage Recommendation

Acknowledgements

Support

WeChat Group

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
assets		assets
classification		classification
data		data
get_latent_action		get_latent_action
lary		lary
regression		regression
utils		utils
LARYBench.pdf		LARYBench.pdf
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

News

Release Checklist

Table of Contents

Overview

Contributions

Environment Setup

Data Preparation

Quick Start

Example A: Regression on CALVIN with LAPA-DINOv2

Example B: Classification on Robot_1st with LAPA-DINOv2

Relative-Action Regression

Supported Models

Adding a Custom Model

Supported Datasets

Classification Datasets

Regression Datasets

Evaluation Outputs

Citation

Data Statements

License

Dataset License Summary

Important Notices

Usage Recommendation

Acknowledgements

Support

WeChat Group

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages