Official implementation of paper "VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis", accepted at ICLR 2026.
VisualPrompter is a training‑free prompt engineering framework that automatically refines user prompts to better align with text‑to‑image models. It operates at the atomic semantic level: a self‑reflection module (SERE) identifies missing concepts by analysing generated images, and a target‑specific optimisation module (TSPO) expands only those concepts while preserving the original intent. The result is a semantically faithful prompt and produced image with higher fidelity to the user’s description.
Clone the repository and install the required dependencies:
pip install -r requirements.txt VisualPrompter uses the following open‑source models internally:
- LLM: Qwen2.5 14B (or other size) for DSG generation and prompt rewriting.
- VLM: Qwen2‑VL 7B for visual question answering.
- Generative Models: Stable Diffusion v1.5 / v2.1, Fluxe-dev, and Janus-pro.
Please download them from Hugging Face, or let the scripts load them automatically.
We provide evaluation scripts for two benchmarks: DSG‑1k and TIFA v1.0.
Please first check the scripts and then run the full evaluation pipeline with:
bash scripts/eval_dsg.sh
bash scripts/eval_tifa.shThe scripts will generate images using the original prompts and the prompts optimised by VisualPrompter, then compute semantic accuracy via the VLM judge.
To test VisualPrompter on a single prompt, use the provided demo:
streamlit run scripts/demo/demo.py This will generate an image with your chosen models both before and after optimisation, and display the results together.
If you find this work useful, please cite our paper:
@inproceedings{wu2026vp,
title={VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis},
author={Shiyu Wu and Mingzhen Sun and Weining Wang and Yequan Wang and Jing Liu},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}
