Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

This is the official repository for M4Doc framework introduced by the following paper: Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation (ACL 2025 Main)

📜 Abstract

Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.

The diagram of the proposed M4Doc.

🛠️ M4Doc

1. Requirements

Follow Ucas-HaoranWei/Vary to prepare the environment, then install the following packages.

sacrebleu==2.3.1
jieba==0.42.1
zss==1.2.0

2. Download pre-trained models and the dataset

Download Vary-base model from Ucas-HaoranWei/Vary.

Download pre-trained Nougat model from facebook/nougat-small.

The DoTA dataset can be downloaded from this huggingface link. Please send an email to liangyupu2021@ia.ac.cn to inform your name and affiliated institution after submitting the download application on Hugging Face.

The file directory structure is as follows:

DIMTDA
├── codes
├── DoTA_dataset
├── pretrained_models
└── utils

3. Pre-train a text translation model

bash pretrain_trans.sh

4. Finetune M4Doc

bash finetune_M4Doc.sh

5. Inference

Before running the script, you need to replace the ~/anaconda3/envs/your_env_name/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py file with the ./utils/modeling_bert.py file.

bash inference.sh

6. Evaluate

bash evaluate.sh

🙏🏻 Acknowledgement

We thank @lukas-blecher and facebookresearch/nougat project for providing the pre-trained model. We also thank Ucas-HaoranWei/Vary project for providing the pre-trained model.

✍🏻 Citation

If you want to cite our paper, please use the following BibTex entries:

@inproceedings{liang-etal-2025-single,
    title = "Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation",
    author = "Liang, Yupu  and
      Zhang, Yaping  and
      Zhang, Zhiyang  and
      Zhao, Yang  and
      Xiang, Lu  and
      Zong, Chengqing  and
      Zhou, Yu",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.606/",
    pages = "12391--12408",
    ISBN = "979-8-89176-251-0",
}

If you have any question, feel free to contact liangyupu2021@ia.ac.cn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

📜 Abstract

🛠️ M4Doc

1. Requirements

2. Download pre-trained models and the dataset

3. Pre-train a text translation model

4. Finetune M4Doc

5. Inference

6. Evaluate

🙏🏻 Acknowledgement

✍🏻 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
codes		codes
images		images
utils		utils
LICENSE		LICENSE
README.md		README.md
evaluate.sh		evaluate.sh
finetune_M4Doc.sh		finetune_M4Doc.sh
inference.sh		inference.sh
pretrain_trans.sh		pretrain_trans.sh

Folders and files

Latest commit

History

Repository files navigation

Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

📜 Abstract

🛠️ M4Doc

1. Requirements

2. Download pre-trained models and the dataset

3. Pre-train a text translation model

4. Finetune M4Doc

5. Inference

6. Evaluate

🙏🏻 Acknowledgement

✍🏻 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages