When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
This is the official implementation for adaptive visual imagination control
Authors: Shoubin Yu*, Yue Zhang*, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal
A single Conda environment holds the VLM framework, the SVC world model, and the RL training stack:
cd visual_spatial_reasoning
conda create -n avic python=3.11 -y
conda activate avic
# CUDA 12.6 builds of PyTorch (adjust to your CUDA)
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 \
--extra-index-url https://download.pytorch.org/whl/cu126
# Stable Virtual Camera world model (editable install; deps in pyproject.toml)
pip install -e stable_virtual_camera/
# Extra deps for RL (GRPO) policy training
pip install -r requirements_train.txtSee visual_spatial_reasoning/README.md
for the full environment, data-preparation, training-free, and RL-training
instructions, including how to download the SAT dataset.
Please follow MapGPT instructions for setting up Room2Room evaluation environment.
You need to
(1) Install Matterport3D simulators: follow instructions here. We use the latest version instead of v0.1.
(2) And then install MapGPT dependencies and data.
(3) install stable virtual camera as in visual spatial reasoning.
We install environment with docker, and re-compile Matterport3D with python 3.10, in this case, you will need to download anaconda in the docker environment.
please set up your API keys in api.py for both tasks before running experiments.
Training-free AVIC (closed-source VLM + SVC world model):
cd visual_spatial_reasoning
sh scripts/pipeline_avic.shRL-trained policy (GRPO). Prepare the train split, then train and evaluate:
cd visual_spatial_reasoning
python utils/data_process.py --split train # train images for RL
sh scripts/train_qwen_grpo.sh # 8-GPU online GRPO training
sh scripts/batch_eval_ckpts.sh nips_results/<run_dir> # evaluate checkpointsOur best policy is the adapter_step140 LoRA adapter (Qwen2.5-VL-7B base),
released on Hugging Face:
Shoubin/AVIC-Qwen2.5-VL-7B-policy.
Its exact training and evaluation settings are documented in
visual_spatial_reasoning/README.md,
which also covers data preparation, hyperparameters, and the full RL pipeline.
cd navigation
sh scripts/gpt4o.shWe thank the developers of MindJourney, MapGPT for their public code release.
Please cite our paper if you use our models in your works:
@article{yu2026when,
author = {Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal},
title = {When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning},
journal = {arxiv: 2602.08236},
year = {2026},
}
