💡 Abstract

[🔥ICLR 2026] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

Zhenghao Zhang∗, Ziying Zhang∗, Junchao Liao∗, Xiangyu Meng, Qiang Hu, Siyu Zhu, Xiaoyun Zhang, Long Qin, Weizhi Wang

* equal contribution

This is the official repository for paper "LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing".

💡 Abstract

Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapped positional encoding and a landmark-aware classifier-free guidance that jointly facilitate flexible yet decoupled interactions among instruction, geometry, and appearance, enabling strong identity preservation; and (3) a landmark predictor that leverages vision–language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency.

📑 Table of Contents

💡 Abstract
🐍 Installation
📦 Model Weights
🔄 Inference
🤝 Acknowledgements
📄 Our previous work
📚 Citation

🐍 Installation

Dependencies: python (tested on python 3.10)

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# download the compatible flash-attn wheel from https://github.com/Dao-AILab/flash-attention/releases
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# if error when installing flash-attn, try install from source. See https://github.com/Dao-AILab/flash-attention

📦 Model Weights

bash scripts/download_model.sh

🔄 Inference

bash scripts/run_inference.sh

🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: Step1X-Edit, Qwen3-VL

Special thanks to the contributors of these libraries for their hard work and dedication!

📄 Our previous work

📚 Citation

@misc{zhang2026latolandmarktokenizeddiffusiontransformer,
      title={LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing}, 
      author={Zhenghao Zhang and Ziying Zhang and Junchao Liao and Xiangyu Meng and Qiang Hu and Siyu Zhu and Xiaoyun Zhang and Long Qin and Weizhi Wang},
      year={2026},
      eprint={2509.25731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25731}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
library		library
modules		modules
scripts		scripts
1_get_landmarks.py		1_get_landmarks.py
2_landmark_predictor.py		2_landmark_predictor.py
3_lato_inference.py		3_lato_inference.py
LICENSE		LICENSE
README.md		README.md
landmark_in.pth		landmark_in.pth
prompt_template.txt		prompt_template.txt
requirements.txt		requirements.txt
sampling.py		sampling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[🔥ICLR 2026] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

💡 Abstract

📑 Table of Contents

🐍 Installation

📦 Model Weights

🔄 Inference

🤝 Acknowledgements

📄 Our previous work

📚 Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[🔥ICLR 2026] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

💡 Abstract

📑 Table of Contents

🐍 Installation

📦 Model Weights

🔄 Inference

🤝 Acknowledgements

📄 Our previous work

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages