Official codebase for the paper “Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.”
This repository provides tools to generate responses, prepare training data, train a Value-guided Inference with Margin-based Reward (ViMaR), and perform supervised fine-tuning (SFT) of a Vision-Language Model (VLM).
The following poster was presented at the conference and provides a concise overview of our motivation, methodology, and key results for ViMaR.
Create and activate the conda environment:
conda env create -f environment.yml
conda activate vimarAfter activating the environment, copy the modified utility files into the appropriate installed packages:
cp ./utils/modeling_llava_next.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/transformers/models/llava_next/
cp ./utils/trainer/td_trainer.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/trainer/
cp ./utils/__init__.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/
cp ./utils/trainer/__init__.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/trainer/Run batch inference to generate responses:
bash ./script/batch_generate.sh or python ./script/batch_generate_command.pyCompute CLIP-based scores to prepare temporal-difference (TD) training data:
bash ./script/clip_score.sh or python ./script/clip_score.pyTrain the value model using the prepared TD data:
bash ./script/train_value.sh or python ./script/train_value.pyPerform supervised fine-tuning of the vision-language model:
bash ./script/train_sft.shIf you find this work useful, please cite our paper:
@article{deria2025dual,
title={Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning},
author={Deria, Ankan and Dukre, Adinath Madhavrao and Tang, Feilong and Atito, Sara and Roy, Sudipta and Awais, Muhammad and Khan, Muhammad Haris and Razzak, Imran},
journal={arXiv preprint arXiv:2506.15649},
year={2025}
}We thank the authors of the original VisVM repository for releasing their code: https://github.com/si0wang/VisVM
