Bowen Qu1*, Shangkun Sun1,2*, Xiaoyu Liang1, Wei Gao1,2
1 Peking University, 2 Peng Cheng Laboratory
(* equal contribution)
If you have any question, feel free to contact 📧.
The "R1 Moment" of Image Editing Quality Assessment.
IE-Critic-R1 is a Multimodal Large Language Model (MLLM) specialized in assessing the quality of text-driven image editing results. It is a pointwise, generative reward model, leveraging Chain-of-Thought (CoT) reasoning SFT and RLVR to provide accurate, human-aligned evaluations of image editing.
- [2025.11] IE-Bench-4k, IE-Critic-R1 Model and all the SFT Data are released on HuggingFace.🤗
IE-Critic-R1, a pointwise generative reward model, treats image editing quality assessment as a reasoning task. Unlike traditional score prediction models, IE-Critic-R1 generates a reasoning trace (within <think>...</think>) before outputting the final score (within <answer>...</answer>). This approach improves the explainability and accuracy of the assessment.
Key features:
- Comprehensive Evaluation: We propose IE-Bench-4k, a comprehensive benchmark for text-driven image editing quality assessment, including text alignment, fidelity, perceptual quality and the overall score.
- Chain-of-Thought Reasoning: Explicitly reasons about text alignment, fidelity, and perceptual quality before final scoring.
- Reinforcement Learning: Optimized using GRPO with verifiable reward to align with human preferences (MOS).
- Superior Performance: Achieves state-of-the-art performance on our proposed IE-Bench-4k dataset, as well as AGIQA-3k (a benchmark for AGI-generated image quality assessment).
-
Clone the repository:
git clone https://github.com/Coobiw/IE-Critic-R1.git cd IE-Critic-R1 -
Create a conda environment:
conda create -n ie_critic python=3.10 conda activate ie_critic
-
Install dependencies:
pip install -r requirements.txt
Note: You may need to install
flash-attnseparately depending on your CUDA version.
We release the IE-Bench-4k dataset, which contains source images, edited images, editing instructions, and human-annotated quality scores (MOS).
Download it from HuggingFace: Coobiw/IE-Bench-4k
p.s.: We also release the mixed SFT Data (including CoT data and direct-scoring data) on HuggingFace: Coobiw/IE-Bench-CoT-mixed. You can use it to train Qwen-2.5-VL model to get IE-Critic-CoT model with LLaMA-Factory repository.
IE-Critic-R1 is trained using the EasyR1 library with GRPO algorithm.
To train the model:
- Configure the training parameters in
examples/config_ie_critic_r1.yaml. - Set the
MODEL_PATHinexamples/ie_critic_r1.shto your base model path (e.g., a SFT model (IE-Critic-CoT) orQwen/Qwen2.5-VL-7B-Instruct). - Run the training script:
bash examples/ie_critic_r1.sh
The training uses IE-Critic-CoT as the base model and optimizes it against the IE-Bench-4k dataset using a reward function based on the L1/Gaussian/Laplace distance to the ground truth score (L1 is the final choice for R1 moment).
To evaluate the trained model on the test set:
python scripts/iebench_eval.py --output_fname results/ie_critic_r1_test.json --saveThis script will:
- Load the model (default:
Coobiw/IE-Critic-R1-7B). - Run inference on the
IE-Bench-4ktest set. - Extract the score from the generated response.
- Calculate PLCC (Pearson Linear Correlation Coefficient) and SRCC (Spearman Rank Correlation Coefficient) against human scores.
To evaluate the trained model on the test set using vLLM:
bash scripts/vllm_serve.sh Coobiw/IE-Critic-R1-7B IE-Critic-R1-7B
python scripts/iebench_eval_vllm.py --output_fname results/ie_critic_r1_test.json --save --api_url http://localhost:8000This script will:
- Load the model (default:
Coobiw/IE-Critic-R1-7B). - Run inference on the
IE-Bench-4ktest set using vLLM.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Coobiw/IE-Critic-R1-7B", torch_dtype="bf16", device_map="cuda"
).eval()
processor = AutoProcessor.from_pretrained("Coobiw/IE-Critic-R1-7B")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/source.jpg"},
{"type": "image", "image": "path/to/edited.jpg"},
{"type": "text", "text": "Edit Instruction: Make the sky blue.\n..."}
]
}
]
# ... (Inference code)If you find this project useful, please cite our paper:
@article{IECriticR1,
title={IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment},
author={Bowen Qu and Shangkun Sun and Xiaoyu Liang and Wei Gao},
journal={arXiv preprint arXiv:2511.18055},
year={2025}
}- This project is built upon Qwen2.5-VL
- LLaMA-Factory for the SFT framework.
- EasyR1 and verl for the RLVR training framework.
- vLLM for the vLLM serving framework.
- Sincerely thank EasyR1 and Awesome-MLLM-Reasoning-Collection for collecting our work.

