Skip to content

Leesangoh/VAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting

ICML 2026 arXiv PDF Project Page HuggingFace License: MIT

🎉 Accepted at ICML 2026 🎉


VAP Banner

"To be truly useful in daily life, robots must discern the subtle details that distinguish 'a cup' from 'my cup.'"

Put my stuffed toy into the plastic bowl
"Put my stuffed toy into the plastic bowl"
Put my brother's dog figurine and my ornament into the plastic bowl
"Put my brother's dog figurine and my ornament into the plastic bowl"
Put my pouch into the plastic bowl
"Put my pouch into the plastic bowl"
Put my cat figurine and my brother's owl figurine into the plastic bowl
"Put my cat figurine and my brother's owl figurine into the plastic bowl"
Put my camera on towel
"Put my camera on towel"
Pick my pen holder
"Pick my pen holder"
Select my leather bag
"Select my leather bag"
Put my straw cup into the basket
"Put my straw cup into the basket"

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.


💡 What is VAP?

Existing VLA models are great at understanding generic commands ("pick up the cup"), but they fail when asked to "pick up my cup" among other similar cups.

Visual Attentive Prompting (VAP) solves this by acting as a pair of "personalized glasses" for the robot.

  1. See & Remember: It takes a few reference photos of your object.
  2. Highlight: It visually detects and highlights the target object in the robot's view.
  3. Act: It guides the frozen VLA model to manipulate the correct object without any expensive training or fine-tuning.
VAP Pipeline

🚀 Run

Once the environments are set up, you can run the benchmarks immediately.

1. Personalized-SIMPLER

Evaluate VAP on SimplerEnv with Bridge/Fractal baselines.

conda activate simpler
bash ./scripts/run_personalized_simpler_vap.sh

2. Personalized-VLABench

Evaluate VAP on VLABench tasks.

conda activate vlabench
bash ./scripts/run_personalized_vlabench_vap.sh

3. Real-world Scenarios

Deploy VAP as a server for real-world robot experiments.

conda activate vap_server
bash ./scripts/run_realworld_vap_server.sh

🛠️ Installation

1. Clone & Assets

First, clone the repository and download the necessary data assets.

git clone https://github.com/Leesangoh/VAP.git
cd VAP
git submodule update --init --recursive

# Set Environment Variables
export VAP_HOME=$(pwd)
export HF_HOME=$HOME/.cache/huggingface  # Modify if needed

📥 Download Assets Please download the files below and place them in the correct directories:

2. Environment Setup

We use 4 separate conda environments to manage dependencies for different baselines. Click the tabs below to expand the installation commands.

🐍 A. Setup `openpi_torch` Env (For pi0 server)
conda create -n openpi_torch python=3.11 -y
conda activate openpi_torch

cd $VAP_HOME/src/simulation/models/openpi_torch
pip install -r requirements.txt
cd packages/openpi-client
pip install -e .
cd ../../
pip install -e .

# Replace transformers modules
cd $VAP_HOME/src/simulation/models/openpi_torch
cp -r ./src/openpi/models_pytorch/transformers_replace/* $(python -c "import transformers; print(transformers.__path__[0])")
conda deactivate
🤖 B. Setup `simpler` Env (For SIMPLER Simulation)
conda create -n simpler python=3.10 -y
conda activate simpler

cd $VAP_HOME/src/simulation/SIMPLER/ManiSkill2_real2sim
pip install -e .
cd ../
pip install -e .

# Dependencies
pip install torch tensorflow==2.15.0 pandas matplotlib omegaconf mediapy websockets
pip install flax==0.5 jax==0.4.1 msgpack hydra-core einops transformers==4.56.0 torchvision bitsandbytes

# Fix JAX version
pip uninstall jaxlib -y
pip install "jaxlib==0.4.1" -i https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/
pip install numpy==1.24.4

conda deactivate
🧪 C. Setup `vlabench` Env (For VLABench Simulation)
conda create -n vlabench python=3.10 -y
conda activate vlabench

cd $VAP_HOME/src/simulation/VLABench
pip install -r requirements.txt
pip install -e .
pip install websockets msgpack torchvision
conda deactivate
🖥️ D. Setup `vap_server` Env (For Real-world Server)
conda create -n vap_server python=3.10 -y
conda activate vap_server

pip install torch torchvision numpy msgpack websockets pillow requests
pip install --upgrade transformers accelerate
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.2/flash_attn-2.8.2+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
conda deactivate

3. Checkpoints

Download the required model weights.

📥 Click to expand Checkpoint Download instructions

A. Paligemma

mkdir -p $HF_HOME
cd $HF_HOME
git clone https://huggingface.co/google/paligemma-3b-pt-224

B. Bridge/Fractal Checkpoints

mkdir -p $HF_HOME/open-pi-zero
cd $HF_HOME/open-pi-zero

# Bridge
wget -c -O bridge_beta_step19296_2024-12-26_22-30_42.pt \
"https://huggingface.co/allenzren/open-pi-zero/resolve/main/bridge_beta_step19296_2024-12-26_22-30_42.pt?download=true"

# Fractal
wget -c -O fractal_beta_step29576_2024-12-29_13-10_42.pt \
"https://huggingface.co/allenzren/open-pi-zero/resolve/main/fractal_beta_step29576_2024-12-29_13-10_42.pt?download=true"

C. VLABench Checkpoint Download model.safetensors from the link below and place it in $HF_HOME/pi05-vlabench.


📝 Citation

If you find this work useful in your research, please cite:

@inproceedings{lee2026bring,
  title={Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting},
  author={Lee, Sangoh and Mo, Sangwoo and Han, Wook-Shin},
  booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
  year={2026}
}

About

Official implementation of "Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting" — a training-free method that enables frozen VLAs to manipulate user-specific objects among visually similar distractors.

Topics

Resources

Stars

Watchers

Forks

Contributors