VISTA adapts UMI-collected demonstrations for VLA policy training. It addresses two practical gaps: wrist-fisheye observations are out of distribution for pretrained vision-language models, and human-collected trajectories can be physically infeasible for a target robot. The released code includes:
- VISTA policy integration in LeRobot for post-training and downstream fine-tuning.
- LIBERO-UMI evaluation runners for VISTA checkpoints.
- Cross-embodiment physical validation tools for replaying and scoring UMI-style trajectories.
Datasets and model checkpoints are released through the Hugging Face collection:
https://huggingface.co/collections/TeleEmbodied/vista
umi-vista/
assets/ # README images and lightweight media
docs/ # Detailed installation and evaluation guides
post_training/lerobot/ # Vendored LeRobot with VISTA policy support
simulation_evaluation/libero_umi/ # LIBERO-UMI runner and summary script
physical_validation/cross_embodiment_replay_and_score/
# Physical replay and scoring tools
RoboTwin-UMI evaluation code is planned for a later release.
You can install only the part you need:
- Post-training / fine-tuning: install LeRobot with the VISTA policy and the local Transformers fork.
- LIBERO-UMI evaluation: install the post-training environment plus LIBERO simulation dependencies.
- Physical validation: install the replay/scoring Python requirements and unpack robot model assets.
The commands below assume Linux, Python 3.10, a CUDA-capable PyTorch installation, and a fresh clone of this repository.
git clone https://github.com/TeleHuman/umi-vista.git
cd umi-vista
export VISTA_ROOT=$PWDCreate and activate an environment:
conda create -n vista python=3.10 -y
conda activate vista
python -m pip install --upgrade pipInstall a PyTorch build that matches your CUDA driver before installing LeRobot. For example, follow the command generated at:
https://pytorch.org/get-started/locally/
Prepare the vendored Transformers fork and install LeRobot with VISTA policy support:
export LEROBOT_ROOT=${VISTA_ROOT}/post_training/lerobot
cd ${LEROBOT_ROOT}
bash third_party/prepare_pi_transformers.sh
pip install --no-build-isolation -e ".[pi]"prepare_pi_transformers.sh unpacks the repository archive:
post_training/lerobot/third_party/transformers-dcddb970176382c0fcf4521b0c0e6fc15894dfe0.zip
into:
post_training/lerobot/third_party/pi_transformers/
The editable install then uses this local Transformers fork through the LeRobot pi optional dependency.
Download a VISTA base checkpoint and a LeRobot-format training dataset from the Hugging Face collection, then launch fine-tuning with the same parameter scale used for VISTA UMI fine-tuning:
cd ${LEROBOT_ROOT}
export DATASET_REPO_ID=/path/to/lerobot_dataset
export DATASET_ROOT=${DATASET_REPO_ID}
export DATASET_REVISION=v2.0
export POLICY_PATH=/path/to/vista_base_checkpoint/pretrained_model
export OUTPUT_ROOT=/path/to/vista_outputs/fine_tuning
export GPU_IDS=0
export GPUS=1
export MAIN_PROCESS_PORT=29510
export BATCH_SIZE=32
export TOTAL_STEPS=40000
export SAVE_FREQ=10000
export LOG_FREQ=50
export LEARNING_RATE=5e-5
export ACTION_CHUNK_SIZE=50
export NUM_WORKERS=32
export SEED=42
export GRADIENT_CHECKPOINTING=true
DATE_TAG=$(date "+%y-%m-%d_%H-%M-%S")
DATASET_NAME="${DATASET_REPO_ID##*/}"
OUTPUT_DIR="${OUTPUT_ROOT}/${DATASET_NAME}/${DATE_TAG}_vista_gpu${GPUS}_ck${ACTION_CHUNK_SIZE}_lr5e-5_bs${BATCH_SIZE}_s40K_seed${SEED}"
CUDA_VISIBLE_DEVICES=${GPU_IDS} accelerate launch \
--num_processes=${GPUS} \
--main_process_port=${MAIN_PROCESS_PORT} \
src/lerobot/scripts/lerobot_train_umi.py \
--dataset.repo_id=${DATASET_REPO_ID} \
--dataset.root=${DATASET_ROOT} \
--dataset.revision=${DATASET_REVISION} \
--dataset.image_transforms.enable=false \
--dataset.wrist_transforms.enable=true \
--policy.dtype=float32 \
--policy.path=${POLICY_PATH} \
--policy.push_to_hub=false \
--policy.chunk_size=${ACTION_CHUNK_SIZE} \
--policy.n_action_steps=${ACTION_CHUNK_SIZE} \
--policy.optimizer_lr=${LEARNING_RATE} \
--policy.gradient_checkpointing=${GRADIENT_CHECKPOINTING} \
--policy.scheduler_decay_steps=36000 \
--policy.use_delta_action=true \
--output_dir=${OUTPUT_DIR} \
--batch_size=${BATCH_SIZE} \
--steps=${TOTAL_STEPS} \
--save_freq=${SAVE_FREQ} \
--log_freq=${LOG_FREQ} \
--num_workers=${NUM_WORKERS} \
--enforce_input_output_replace=true \
--seed=${SEED}The saved checkpoint metadata should contain:
{"type": "vista"}For a short environment validation run, use the smoke-test wrapper documented in docs/installation_finetuning.md.
LIBERO-UMI evaluation builds on the post-training installation. Install LeRobot with the LIBERO extra:
cd ${LEROBOT_ROOT}
bash third_party/prepare_pi_transformers.sh
pip install --no-build-isolation -e ".[pi,libero]"Set the checkpoint and runtime paths. By default, the runner evaluates 50 episodes for every task in each selected suite, not 50 total episodes per suite.
cd ${VISTA_ROOT}
POLICY_PATH=/path/to/vista_libero_umi_checkpoint/pretrained_model \
OUTPUT_ROOT=/path/to/vista_outputs/libero_umi_eval \
GPU_ID=0 \
bash simulation_evaluation/libero_umi/run_libero_umi_eval.shBy default the runner evaluates:
libero_10 libero_goal libero_object libero_spatial
Override SUITES to run a subset:
SUITES="libero_goal" \
POLICY_PATH=/path/to/vista_libero_umi_checkpoint/pretrained_model \
bash simulation_evaluation/libero_umi/run_libero_umi_eval.shRun a quick five-episodes-per-task check by explicitly overriding the default:
POLICY_PATH=/path/to/vista_libero_umi_checkpoint/pretrained_model \
N_EPISODES_PER_TASK=5 \
bash simulation_evaluation/libero_umi/run_libero_umi_eval.shThe runner writes per-suite logs, eval_info.json files, and a summary.tsv under OUTPUT_ROOT.
If your LIBERO installation does not already include assets, BDDL files, or initial states, see docs/installation_libero_umi.md for the required LIBERO_CONFIG_PATH, asset paths, and optional download commands.
The physical-validation tool replays UMI-style trajectories in robot-specific MuJoCo models and writes trajectory quality scores.
cd ${VISTA_ROOT}/physical_validation/cross_embodiment_replay_and_score
python3 -m pip install -r requirements.txt
bash prepare_model_assets.shprepare_model_assets.sh unpacks:
physical_validation/cross_embodiment_replay_and_score/model_assets.tar.gz
and creates the generated directory:
physical_validation/cross_embodiment_replay_and_score/model/
Run simulation scoring for one robot:
export SCORE_FOLDER_PATH=/path/to/task_trajectory_folder
export ROBOT_NAME=rm75
python3 replay.pySupported robot names are:
acone
r1pro
rm75
Run one indexed trajectory:
export SCORE_FOLDER_PATH=/path/to/task_trajectory_folder
export ROBOT_NAME=r1pro
python3 replay.py 0Score one task folder across all supported robots:
export SCORE_FOLDER_PATH=/path/to/task_trajectory_folder
bash run_replay_all.shScore all task folders under one or more parent directories:
BASE_DIRS="/path/to/task_parent_a /path/to/task_parent_b" bash run_replay_all_batch.shResults are written under:
physical_validation/cross_embodiment_replay_and_score/log/
If you find VISTA useful, please cite the paper:
@misc{yang2026vistavisiongroundedphysicsvalidatedadaptation,
title={VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training},
author={Siyuan Yang and Linzheng Guo and Ouyang Lu and Zhaxizhuoma and Daoran Zhang and Xinmiao Wang and Ting Xiao and Fangzheng Yan and Zhijun Chen and Yan Ding and Chao Yu and Chenjia Bai and Xuelong Li},
year={2026},
eprint={2606.04708},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.04708},
}This project is released under the Apache License 2.0. See LICENSE.
