Official implementation of PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization.
- π News
- π Overview
- π Persona-MME Benchmark
- π Performance
- π Quick Start
- π» Interactive Gradio Demo
- βοΈ Citation
- [2026.04] π’ PersonaVLM officially Open-Sourced! Model weights, training data, and the Persona-MME benchmark are now released!
PersonaVLM transforms general-purpose MLLMs (e.g., Qwen2.5-VL) into personalized assistants capable of long-term interaction. It achieves this through a collaborative two-stage process featuring three core capabilities:
- π§ Remembering: Proactively extracts and summarizes multimodal conversational histories into a structured, multi-type database (Core, Semantic, Episodic, and Procedural memories).
- π‘ Reasoning: Conducts multi-turn reasoning by dynamically retrieving and integrating relevant long-term memories based on the conversation context.
- π€ Response Alignment: Infers the user's latest latent traits using a Momentum-based Personality Evolving Mechanism (PEM), ensuring generated responses are deeply aligned with the user's evolving characteristics.
To rigorously evaluate long-term personalization in multimodal settings, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases designed to assess MLLMs across 14 fine-grained personalization tasks.
Extensive experiments demonstrate that PersonaVLM significantly enhances a model's personalization capabilities and consistently outperforms strong counterparts, including proprietary models like GPT-4o and leading open-source alternatives. Under a 128k context setting, PersonaVLM achieves substantial improvements of 22.4% on Persona-MME and 9.8% on PERSONAMEM, surpassing GPT-4o by 5.2% and 2.0%, respectively.
Before starting, ensure your project directory is organized as follows. The core reasoning and memory management logic is encapsulated within the PersonaVLM/ module.
PersonaVLM/
βββ checkpoints/
β βββ rl/ # Download PersonaVLM weights here
β βββ openai/clip-vit-base-patch32 # Local CLIP model
β βββ sentence-transformers/all-MiniLM-L6-v2
βββ data/
β βββ Persona-MME/ # Benchmark dataset for evaluation
β βββ training/ # SFT and RL synthesized datasets
βββ PersonaVLM/ # Core PersonaVLM Agent Logic
β βββ model.py
β βββ PersonaVLMAgent.py
β βββ prompts.py
β βββ retriever.py
β βββ tools.py
β βββ utils.py
βββ train/ # SFT and RL training scripts
βββ eval.py # Evaluation script for Persona-MME
βββ inference.py # CLI inference script
βββ gradio_demo.py # Interactive Web UI
Create a conda environment and install the required dependencies.
conda create -n PersonaVLM python=3.10 -y
conda activate PersonaVLM
pip install -r requirements.txtWarning
Since PersonaVLM is built upon Qwen2.5-VL, it is strictly required to install transformers==4.51.3 to prevent compatibility issues.
- Model Weights: Download the official PersonaVLM Model and place it under
./checkpoints/rl. - Retrieval Models: Download CLIP and all-MiniLM-L6-v2 models to the
./checkpoints/directory. - Datasets: Download Persona-MME (for evaluation) and PersonaVLM-Dataset (for training) into
./data/.
We provide a CLI script to chat with the PersonaVLM agent. It supports both standard reasoning and forced retrieval configurations:
python inference.py
# Optional arguments:
# --force-retrieve : Force the agent to execute memory retrieval for every message.
# --reasoning-mode : Flag to toggle reasoning (default is True; passing this sets it to False).To evaluate the model's long-term personalization capabilities on the Persona-MME benchmark:
python eval.py \
--model_path ./checkpoints/rl \
--bench_context 32k \
--save_dir ./output/32k_Persona_MME_resultsPersonaVLM is trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
Note on Qwen2.5-VL: The Qwen2.5-VL repository used for training does not support mixing pure-text and multimodal samples in a single batch. To resolve this, we append a dummy conversation to each text-only sample and mask the final interaction step during loss computation (Implementation details in ./train/sft/Qwen2.5-VL/qwen-vl-finetune/qwenvl/data/data_qwen.py).
# 1. Regenerate data with dummy convs
python ./data/training/sft/regenerate.py
# 2. Launch SFT training
cd ./train/sft/Qwen2.5-VL/qwen-vl-finetune
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 sh scripts/sft_PersonaVLM.shOur RL stage is implemented based on the ms-swift framework. Key modifications for our custom reward design are located in scripts/qwen_server.py and examples/train/grpo/plugin/plugin.py.
Before starting RL, you must deploy the reward model (we use Qwen3-30B-A3B-Instruct-2507) via vLLM:
# 1. Deploy the Reward Model Server
MASTER_PORT=29501 CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server \
--model /path/to/Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8080 \
--trust-remote-code \
--gpu-memory-utilization 0.9
# (Optional) Test the reward server connection
python ./train/ms-swift-main/scripts/qwen_server.py
# 2. Launch RL Training
cd ./train/ms-swift-main
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 sh scripts/rl.shWe provide an interactive Gradio-based playground to explore the "R3" process (Proactive Remembering, Multi-step Reasoning, and Personality-based Response Alignment). The UI allows you to visualize the agent's internal cognitive steps.
To start the demo, ensure your model weights are in ./checkpoints/rl and run:
python gradio_demo.pyIf you find our work helpful, please cite:
@inproceedings{nie2026personavlm,
title={PersonaVLM: Long-Term Personalized Multimodal LLMs},
author={Nie, Chang and Fu, Chaoyou and Zhang, Yifan and Yang, Haihua and Shan, Caifeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
url={http://arxiv.org/abs/2604.13074}
}




