Ruichuan An, Sihan Yang*, Renrui Zhang†, Zijun Shen, Ming Lu, Gaole Dai
Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo,
Bocheng Zou, Chaoqun Yang, Wentao Zhang‡
* Equal Contribution † Project Leader ‡ Corresponding Author
🎯 Project Page |
📄 Paper |
📦 Dataset |
🚀 Quick Start |
⚙️ Training
📊 Evaluation |
📜 License |
📝 Citation |
📬 Contact
UniCTokens is an innovative framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation tasks. Existing methods typically treat understanding and generation tasks separately, which limits the model's ability to generate images with complex prompts. For example, given the concept
UniCTokens addresses this limitation through a three-stage progressive training strategy:
- Understanding warm-up.
- Bootstrapping generation from understanding.
- Deepening understanding from generation(Generation as Perception).
Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.
- 🔄 Unified Concept Tokens: Unifying personalizaed understanding and generation tasks in a single model.
- 🧠 Personalized Knowledge-Driven Generation: Leveraging external personalized knowledge for complex image generation.
- 📈 Mutual Enhancement: Three-stage strategy promoting mutual enhancement of understanding and generation, achieving cross-task information transfer.
- 📊 UnifyBench: The first benchmark for assessing personalized understanding, generation, and personalized knowledge-driven generation all in one.
| Sub-task | Source files | Evaluation focus |
|---|---|---|
| Text-Only QA | test/<concept>/text_only.json |
Check whether the model remembers concept knowledge (no image) |
| VQA | test/<concept>/vqa.json + image |
Visual question answering about the concept image |
| Rec | test/*.png |
Pure visual recognition capability |
| Mode | Input | Metrics |
|---|---|---|
| Vanilla generation | Prompts from the DreamBooth Dataset → target-concept images | CLIP-I / CLIP-T · ArcFace similarity |
| Personalized knowledge-driven | t2i_conditions.json |
Combined T2I-Score: must satisfy both visual & textual attributes |
| Item | Description |
|---|---|
| Total concepts | 20 (Human × 10 · Animal × 5 · Object × 5) |
| Images per concept | N ≈ 10 – 15 (already split into train / test) |
| Negative samples | random_images/ (100 random irrelevant images) + negative_example/ (hard negatives) |
UniCTokens/
├── black_512x512.png # Pure black placeholder
├── concepts_list.json # List of 20 concept names
├── template.json # Template for generating training data
├── random_images/ # 100 simple negative samples for training
│ ├── 0.png
│ └── … 99.png
├── concept/ # 🔑 Concept data (train / test)
│ ├── train/
│ │ └── <concept_name>/ # 20 folders
│ │ ├── 0.png … N.png # Original training images
│ │ ├── cropped/ # Cropped regions
│ │ ├── info.json # Concept profile & extra info
│ │ ├── conversations.json # Training dialogues
│ │ ├── positive_recognitions.json # Positive QA pairs
│ │ ├── random_recognitions.json # Negative QA pairs
│ │ └── negative_example/ # Hard negatives + score.json
│ └── test/
│ └── <concept_name>/
│ ├── 0.png … 4.png
│ ├── text_only.json # Text-only QA
│ ├── vqa.json # VQA pairs
│ └── t2i_conditions.json # Conditions for knowledge-driven T2I
├── gen_showo_training_data.py # Script to create Stage-1/2/3 training files
├── gen_test_data.py # Script to create all evaluation files
└── README.md
-
Set the dataset root Open
gen_showo_training_data.pyandgen_test_data.pyin the dataset root, changeDATA_ROOT = "/path/to/UniCTokens_Dataset"
to the actual dataset path.
-
Generate data
# Create Stage-1/2/3 training samples python gen_showo_training_data.py # Create MMU & T2I evaluation samples python gen_test_data.py
First, install dependencies:
pip install -r requirements.txtOur training is conducted per concept, and we provide a Three-Stage Training Framework script that allows training for any given concept:
concept="bo"
# Stage 1: Understanding warm-up
python train_w_3_stages/train_p_stage_1.py --concept "${concept}" --data_root <path/to/uni_c_tokens_data> --task_name test_train_s1 --need_new_tokens --mmu_data --init_by_images --need_init
# Stage 2: Bootstrapping generation from understanding
python train_w_3_stages/train_p_stage_2.py --concept "${concept}" --data_root <path/to/uni_c_tokens_data> --task_name test_train_s2 --pre_trained_ckpt_name test_train_s1 --t2i_data --mmu_data
# Transition from Stage 2 to Stage 3
python train_w_3_stages/stage_2_to_3_v1.py --concept "${concept}" --ckpt_name test_train_s2
# Stage 3: Deepening understanding from generation
python train_w_3_stages/train_p_stage_3.py --concept "${concept}" --data_root <path/to/uni_c_tokens_data> --task_name example --pre_trained_ckpt_name test_train_s2 --t2i_data --mmu_data Our evaluation procedure follows the same concept-based setup as training, where each concept is evaluated individually. We provide scripts to evaluate any given concept across various metrics:
⚠️ The following experiment examples are all run using the weights undersaves/bo/example. Therefore, theconceptparameter isbo, andckpt_nameisexample.
- Personalized Understanding Evaluation Script
First, set your DeepSeek API key in eval/eval_p_mmu.py:
CLIENT = init_deepseek("your api key")Then run the evaluation:
python eval/eval_p_mmu.py --data_root <path/to/uni_c_tokens_data> --concept bo --ckpt_name example --epoch_to_load 20- Personalized Generation - Pure Generation Evaluation Script
For Pure Generation, we use the test prompts from the DreamBooth dataset to compute CLIP-I and CLIP-T scores. First, generate the images to be evaluated:
python eval/gen_p_images_for_gen_eval.py --data_root <path/to/uni_c_tokens_data> --concept bo --ckpt_name example --epoch_to_load 20 --inverse_promptAfter generating the images, modify the parameters as needed in eval/clip_eval.py and run eval/clip_eval.py to complete the evaluation.
- Personalized Generation — People Generation & Knowledge-driven Generation Evaluation Scripts
For People Generation and Knowledge-driven Generation, first generate the images to be evaluated:
python eval/gen_p_images_for_mmu_t2i.py --data_root <path/to/uni_c_tokens_data> --concept bo --ckpt_name example --epoch_to_load 20 --inverse_prompt- People Generation: Modify the parameters in
eval/face_eval_v2.py, then runeval/face_eval_v2.pyto evaluate face generation. - Knowledge-driven Generation: First set your GPT API key in
eval/api.py, then modify the parameters ineval/4o_judge_t2i.pyand run it to complete the evaluation.
The dataset and code are released under CC-BY-NC 4.0 and are intended for academic research only. Commercial use is not permitted.
If you use UniCTokens in your research, please cite our paper:
@article{an2025unictokens,
title={UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens},
author={An, Ruichuan and Yang, Sihan and Zhang, Renrui and Shen, Zijun and Lu, Ming and Dai, Gaole and Liang, Hao and Guo, Ziyu and Yan, Shilin and Luo, Yulin and others},
journal={arXiv preprint arXiv:2505.14671},
year={2025}
}