ViMaR

Official codebase for the paper “Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.”

This repository provides tools to generate responses, prepare training data, train a Value-guided Inference with Margin-based Reward (ViMaR), and perform supervised fine-tuning (SFT) of a Vision-Language Model (VLM).

🖼️ Conference Presentation Poster

The following poster was presented at the conference and provides a concise overview of our motivation, methodology, and key results for ViMaR.

📄 Paper: Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

📦 Environment Setup

Create and activate the conda environment:

conda env create -f environment.yml
conda activate vimar

Patch Required Libraries

After activating the environment, copy the modified utility files into the appropriate installed packages:

cp ./utils/modeling_llava_next.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/transformers/models/llava_next/

cp ./utils/trainer/td_trainer.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/trainer/

cp ./utils/__init__.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/

cp ./utils/trainer/__init__.py \
~/.conda/envs/vimar/lib/python3.12/site-packages/trl/trainer/

⚠️ Note: These steps overwrite files in your local Python environment. It is strongly recommended to use a dedicated conda environment.

🧠 Generate Model Responses

Run batch inference to generate responses:

bash ./script/batch_generate.sh or python ./script/batch_generate_command.py

🧪 Prepare TD Training Data

Compute CLIP-based scores to prepare temporal-difference (TD) training data:

bash ./script/clip_score.sh or python ./script/clip_score.py

📈 Train Vision Value Model (ViMaR)

Train the value model using the prepared TD data:

bash ./script/train_value.sh or python ./script/train_value.py

🎯 Supervised Fine-Tuning (SFT) of VLM

Perform supervised fine-tuning of the vision-language model:

bash ./script/train_sft.sh

📄 Citation

If you find this work useful, please cite our paper:

@article{deria2025dual,
  title={Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning},
  author={Deria, Ankan and Dukre, Adinath Madhavrao and Tang, Feilong and Atito, Sara and Roy, Sudipta and Awais, Muhammad and Khan, Muhammad Haris and Razzak, Imran},
  journal={arXiv preprint arXiv:2506.15649},
  year={2025}
}

🙏 Acknowledgements

We thank the authors of the original VisVM repository for releasing their code: https://github.com/si0wang/VisVM

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
script		script
utils		utils
Poster.png		Poster.png
README.md		README.md
batch_generate.py		batch_generate.py
control_decoding_final.py		control_decoding_final.py
environment.yml		environment.yml
generate_clip_score.py		generate_clip_score.py
sft_training.py		sft_training.py
value_training.py		value_training.py
vlm_value_models.py		vlm_value_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViMaR

🖼️ Conference Presentation Poster

📦 Environment Setup

Patch Required Libraries

🧠 Generate Model Responses

🧪 Prepare TD Training Data

📈 Train Vision Value Model (ViMaR)

🎯 Supervised Fine-Tuning (SFT) of VLM

📄 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViMaR

🖼️ Conference Presentation Poster

📦 Environment Setup

Patch Required Libraries

🧠 Generate Model Responses

🧪 Prepare TD Training Data

📈 Train Vision Value Model (ViMaR)

🎯 Supervised Fine-Tuning (SFT) of VLM

📄 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages