Skip to content

alibaba/landmark-tokenized-dit

Repository files navigation

[🔥ICLR 2026] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

Zhenghao Zhang∗, Ziying Zhang∗, Junchao Liao∗, Xiangyu Meng, Qiang Hu, Siyu Zhu, Xiaoyun Zhang, Long Qin, Weizhi Wang

* equal contribution

This is the official repository for paper "LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing".

💡 Abstract

Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapped positional encoding and a landmark-aware classifier-free guidance that jointly facilitate flexible yet decoupled interactions among instruction, geometry, and appearance, enabling strong identity preservation; and (3) a landmark predictor that leverages vision–language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency.

📑 Table of Contents

🐍 Installation

Dependencies: python (tested on python 3.10)

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# download the compatible flash-attn wheel from https://github.com/Dao-AILab/flash-attention/releases
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# if error when installing flash-attn, try install from source. See https://github.com/Dao-AILab/flash-attention

📦 Model Weights

bash scripts/download_model.sh

🔄 Inference

bash scripts/run_inference.sh

🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: Step1X-Edit, Qwen3-VL

Special thanks to the contributors of these libraries for their hard work and dedication!

📄 Our previous work

📚 Citation

@misc{zhang2026latolandmarktokenizeddiffusiontransformer,
      title={LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing}, 
      author={Zhenghao Zhang and Ziying Zhang and Junchao Liao and Xiangyu Meng and Qiang Hu and Siyu Zhu and Xiaoyun Zhang and Long Qin and Weizhi Wang},
      year={2026},
      eprint={2509.25731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25731}, 
}

About

Official repository for paper "LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing"

Resources

License

Stars

Watchers

Forks

Contributors