Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting

🎉 Accepted at ICML 2026 🎉

"To be truly useful in daily life, robots must discern the subtle details that distinguish 'a cup' from 'my cup.'"

"Put my stuffed toy into the plastic bowl"	"Put my brother's dog figurine and my ornament into the plastic bowl"	"Put my pouch into the plastic bowl"	"Put my cat figurine and my brother's owl figurine into the plastic bowl"
"Put my camera on towel"	"Pick my pen holder"	"Select my leather bag"	"Put my straw cup into the basket"

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

💡 What is VAP?

Existing VLA models are great at understanding generic commands ("pick up the cup"), but they fail when asked to "pick up my cup" among other similar cups.

Visual Attentive Prompting (VAP) solves this by acting as a pair of "personalized glasses" for the robot.

See & Remember: It takes a few reference photos of your object.
Highlight: It visually detects and highlights the target object in the robot's view.
Act: It guides the frozen VLA model to manipulate the correct object without any expensive training or fine-tuning.

🚀 Run

Once the environments are set up, you can run the benchmarks immediately.

1. Personalized-SIMPLER

Evaluate VAP on SimplerEnv with Bridge/Fractal baselines.

conda activate simpler
bash ./scripts/run_personalized_simpler_vap.sh

2. Personalized-VLABench

Evaluate VAP on VLABench tasks.

conda activate vlabench
bash ./scripts/run_personalized_vlabench_vap.sh

3. Real-world Scenarios

Deploy VAP as a server for real-world robot experiments.

conda activate vap_server
bash ./scripts/run_realworld_vap_server.sh

🛠️ Installation

1. Clone & Assets

First, clone the repository and download the necessary data assets.

git clone https://github.com/Leesangoh/VAP.git
cd VAP
git submodule update --init --recursive

# Set Environment Variables
export VAP_HOME=$(pwd)
export HF_HOME=$HOME/.cache/huggingface  # Modify if needed

📥 Download Assets Please download the files below and place them in the correct directories:

User-provided Object Photos (Unzip into ${VAP_HOME}/datasets)
- https://drive.google.com/file/d/1GMhmpQJ7ylMcYjjrUtxxq8Kl88V2jFr0/view?usp=drive_link
Personalized-VLABench Assets (Unzip into ${VAP_HOME}/src/simulation/VLABench/VLABench/assets)
- https://drive.google.com/file/d/1JLoNanInxlTHmc-blqV4imTBBbylyJeV/view?usp=drive_link

2. Environment Setup

We use 4 separate conda environments to manage dependencies for different baselines. Click the tabs below to expand the installation commands.

🐍 A. Setup `openpi_torch` Env (For pi0 server)

conda create -n openpi_torch python=3.11 -y
conda activate openpi_torch

cd $VAP_HOME/src/simulation/models/openpi_torch
pip install -r requirements.txt
cd packages/openpi-client
pip install -e .
cd ../../
pip install -e .

# Replace transformers modules
cd $VAP_HOME/src/simulation/models/openpi_torch
cp -r ./src/openpi/models_pytorch/transformers_replace/* $(python -c "import transformers; print(transformers.__path__[0])")
conda deactivate

🤖 B. Setup `simpler` Env (For SIMPLER Simulation)

conda create -n simpler python=3.10 -y
conda activate simpler

cd $VAP_HOME/src/simulation/SIMPLER/ManiSkill2_real2sim
pip install -e .
cd ../
pip install -e .

# Dependencies
pip install torch tensorflow==2.15.0 pandas matplotlib omegaconf mediapy websockets
pip install flax==0.5 jax==0.4.1 msgpack hydra-core einops transformers==4.56.0 torchvision bitsandbytes

# Fix JAX version
pip uninstall jaxlib -y
pip install "jaxlib==0.4.1" -i https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/
pip install numpy==1.24.4

conda deactivate

🧪 C. Setup `vlabench` Env (For VLABench Simulation)

conda create -n vlabench python=3.10 -y
conda activate vlabench

cd $VAP_HOME/src/simulation/VLABench
pip install -r requirements.txt
pip install -e .
pip install websockets msgpack torchvision
conda deactivate

🖥️ D. Setup `vap_server` Env (For Real-world Server)

conda create -n vap_server python=3.10 -y
conda activate vap_server

pip install torch torchvision numpy msgpack websockets pillow requests
pip install --upgrade transformers accelerate
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.2/flash_attn-2.8.2+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
conda deactivate

3. Checkpoints

Download the required model weights.

📥 Click to expand Checkpoint Download instructions

A. Paligemma

mkdir -p $HF_HOME
cd $HF_HOME
git clone https://huggingface.co/google/paligemma-3b-pt-224

B. Bridge/Fractal Checkpoints

mkdir -p $HF_HOME/open-pi-zero
cd $HF_HOME/open-pi-zero

# Bridge
wget -c -O bridge_beta_step19296_2024-12-26_22-30_42.pt \
"https://huggingface.co/allenzren/open-pi-zero/resolve/main/bridge_beta_step19296_2024-12-26_22-30_42.pt?download=true"

# Fractal
wget -c -O fractal_beta_step29576_2024-12-29_13-10_42.pt \
"https://huggingface.co/allenzren/open-pi-zero/resolve/main/fractal_beta_step29576_2024-12-29_13-10_42.pt?download=true"

C. VLABench Checkpoint Download model.safetensors from the link below and place it in $HF_HOME/pi05-vlabench.

Link: Huggingface or Google Drive

📝 Citation

If you find this work useful in your research, please cite:

@inproceedings{lee2026bring,
  title={Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting},
  author={Lee, Sangoh and Mo, Sangwoo and Han, Wook-Shin},
  booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
src		src
static		static
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting

🎉 Accepted at ICML 2026 🎉

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

💡 What is VAP?

🚀 Run

1. Personalized-SIMPLER

2. Personalized-VLABench

3. Real-world Scenarios

🛠️ Installation

1. Clone & Assets

2. Environment Setup

3. Checkpoints

📝 Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bring My Cup! ☕Personalizing Vision-Language-Action Models with Visual Attentive Prompting

🎉 Accepted at ICML 2026 🎉

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

💡 What is VAP?

🚀 Run

1. Personalized-SIMPLER

2. Personalized-VLABench

3. Real-world Scenarios

🛠️ Installation

1. Clone & Assets

2. Environment Setup

3. Checkpoints

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting