Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
We introduce Uni-MuMER, which fully fine-tunes the Qwen2.5-VL-3B model for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.
We further extend Uni-MuMER to newer backbone architectures (Qwen3.5 and Qwen3-VL) and provide a family of fine-tuned models for the community.
Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model, SSAN, by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting.
- 2026-04-13: Release 4 new model variants: Qwen3.5-2B, Qwen3.5-4B, Qwen3-VL-2B, Qwen3-VL-4B. See Model Zoo
- 2026-03-29: Release preprocessing scripts for UniMuMER-Tree, symbol-counting data construction, and the MathNet-based HMER prompt-data pipeline. See Preprocessing
- 2025-09-18: This work got accepted to NeurIPS 2025 as a Spotlight (688/21575).
- 2025-09-09 : Release dataset (Uni-MuMER-Data and valid/test data) and training code. See Training
- 2025-06-02: Release of model weights and inference scripts.
- Download
data.zipfrom GitHub, Huggingface, or Google Drive link. - Unzip it at the project root. After extraction, you should have:
data
├── CROHME/
├── CROHME2023/
├── HME100K/
├── Im2LaTeXv2/
├── MathWriting/
└── MNE/
We provide preprocessing scripts for the released UniMuMER-Tree and symbol-counting data construction pipeline in preprocess/.
The current release includes:
tree: generation of the UniMuMER-Tree supervision target from tokenized LaTeXcan: generation of symbol-counting supervision paired with the corresponding LaTeX targetmake_mathnet_hmer_data.py: port of the MathNet-based caption cleaning and HMER prompt-data pipeline frommathnet-ly/0410_proc_file.ipynbpreprocess/MathNet4clean: a pinned submodule for external MathNet preprocessing utilities and normalization reference code
The main entry points are:
python preprocess/make_unimumer_tree_data.py --task tree --input <input-json> --output <output-json>python preprocess/make_mathnet_hmer_data.py process \
--input <caption-txt-or-json> \
--output-dir <output-dir> \
--image-base-path <image-base-path> \
--mathnet-root preprocess/MathNet4cleanFor example:
python preprocess/make_unimumer_tree_data.py \
--task tree \
--input preprocess/examples/unimumer_tree/sample_input.json \
--output preprocess/examples/unimumer_tree/sample_tree.jsonAdditional usage details, implementation notes, and ready-to-run examples are provided in preprocess/README.md, preprocess/examples/unimumer_tree/, and preprocess/examples/mathnet_hmer/.
For the MathNet-based preprocessing dependency, initialize submodules after cloning:
git submodule update --init --recursiveThe Uni-MuMER-specific MathNet normalization path is implemented in
preprocess/MathNet4clean/preprocessing/improve_tokens_unimumer.py.
The MathNet HMER pipeline keeps the notebook-style intermediate outputs
(step1 to step9, invalid_record.json, and <output_dir>_0425_final.json)
while replacing the original hard-coded external MathNet path with the in-repo
submodule.
After the dataset is in place, you can run batch inference over all three test sets with one of the two commands below.
bash eval/eval_crohme.sh -i <input-dir> -o <output-dir> -m <model> -b <batch_size>Example
bash eval/eval_all.sh -m models/Uni-MuMER-3B -s test1 -b 32768python scripts/vllm_infer.py --input-dir <input-dir> --output-dir <output-dir> --model <model> --batch_size <batch_size>Tip:
-
To select GPUs on multi‑GPU machines just export
CUDA_VISIBLE_DEVICESbefore running the script, e.g.,export CUDA_VISIBLE_DEVICES=1,2 -
For batch_size, you can use the
--batch_sizeargument to control the number of samples pervLLM.generate()call. The default value is 32768, which is prevented from being too large to avoid OOM errors.
This repository currently provides text-based evaluation through
scripts/eval_metrics_calculator.py,
including Edit Score, BLEU-4, CER, and exact-match rate.
We do not currently vendor the official CDM evaluator in this repository.
The ExpRate@CDM results reported in our paper follow the visually grounded
Character Detection Matching (CDM) protocol. As described in
our paper, ExpRate@CDM is used as a
visual-equivalence-aware accuracy metric beyond exact string match.
For reproducing ExpRate@CDM, please use the official
UniMERNet CDM toolkit.
The official implementation provides the evaluation entry point
cdm/evaluation.py, and our inference outputs already contain the core fields
(gt, pred, and img_id) expected by that evaluator.
| Model | Base | Params | Avg ExpRate | Avg CDM ExpRate | HuggingFace |
|---|---|---|---|---|---|
| Uni-MuMER-Qwen2.5-VL-3B | Qwen2.5-VL-3B-Instruct | 3.4B | 72.19% | 80.75% | Link |
| Uni-MuMER-Qwen2.5-VL-7B | Qwen2.5-VL-7B-Instruct | 8.3B | - | - | Link |
| Uni-MuMER-Qwen3.5-2B | Qwen3.5-2B | 2.2B | 73.09% | 81.15% | Link |
| Uni-MuMER-Qwen3.5-4B | Qwen3.5-4B | 4.5B | 71.60% | 81.19% | Link |
| Uni-MuMER-Qwen3-VL-2B | Qwen3-VL-2B-Instruct | 2.1B | 72.49% | 80.35% | Link |
| Uni-MuMER-Qwen3-VL-4B | Qwen3-VL-4B-Instruct | 4.4B | 72.11% | 80.67% | Link |
Note: Qwen3.5 models require
transformers >= 5.0.0. Qwen3-VL models use theqwen3_vl_nothinktemplate; Qwen3.5 models useqwen3_5_nothink.
| Dataset | Samples | Uni-MuMER-3B | Qwen3.5-2B | Qwen3.5-4B | Qwen3-VL-2B | Qwen3-VL-4B |
|---|---|---|---|---|---|---|
| CROHME 2014 | 986 | 82.25% | 83.98% | 82.56% | 83.27% | 82.35% |
| CROHME 2016 | 1,147 | 78.29% | 81.17% | 78.20% | 78.55% | 79.34% |
| CROHME 2019 | 1,199 | 79.82% | 80.15% | 75.98% | 79.40% | 78.98% |
| CROHME 2023 Test | 2,300 | 69.52% | 69.43% | 66.74% | 70.96% | 69.17% |
| HME100K Test | 24,607 | 69.50% | 70.43% | 70.02% | 69.31% | 69.79% |
| Im2LaTeXv2 Test | 10,118 | 76.99% | 77.82% | 77.20% | 77.11% | 77.46% |
| MathWriting Test | 7,643 | 53.03% | 51.84% | 54.32% | 50.66% | 53.15% |
| MNE-N1 | 1,875 | 75.89% | 74.72% | 69.76% | 79.63% | 70.51% |
| MNE-N2 | 304 | 57.89% | 61.18% | 54.93% | 65.13% | 52.96% |
| MNE-N3 | 1,464 | 46.72% | 56.83% | 56.69% | 52.39% | 54.44% |
| Average | 72.19% | 73.09% | 71.60% | 72.49% | 72.11% |
| Dataset | Uni-MuMER-3B | Qwen3.5-2B | Qwen3.5-4B | Qwen3-VL-2B | Qwen3-VL-4B |
|---|---|---|---|---|---|
| CROHME 2014 | 85.40% | 87.10% | 87.90% | 86.50% | 86.00% |
| CROHME 2016 | 80.90% | 82.70% | 82.00% | 81.50% | 82.00% |
| CROHME 2019 | 82.70% | 82.70% | 82.80% | 82.60% | 82.20% |
| CROHME 2023 Test | 79.00% | 79.20% | 78.00% | 78.70% | 79.10% |
| HME100K Test | 72.50% | 73.60% | 73.00% | 72.40% | 72.90% |
| Im2LaTeXv2 Test | 93.40% | 93.70% | 91.90% | 92.60% | 93.10% |
| MathWriting Test | 72.50% | 70.70% | 74.30% | 69.00% | 71.70% |
| MNE-N1 | 81.20% | 82.30% | 82.30% | 82.60% | 81.10% |
| MNE-N2 | 71.10% | 72.70% | 74.30% | 72.40% | 71.70% |
| MNE-N3 | 67.30% | 80.80% | 78.50% | 72.80% | 76.00% |
| Average | 80.75% | 81.15% | 81.19% | 80.35% | 80.67% |
Evaluation: vLLM 0.19.0, temperature=0.2, max_tokens=2048. CDM computed with UniMERNet/cdm.
Our training code depends on LLaMA-Factory.
For training dependencies, please refer to LLaMA-Factory or requirements_training.txt.
llamafactory-cli train train/Uni-MuMER-train.yamlTraining configs for newer backbones are provided in train/:
# Qwen3.5 models (requires transformers >= 5.0.0)
llamafactory-cli train train/Uni-MuMER-Qwen3.5-2B.yaml
llamafactory-cli train train/Uni-MuMER-Qwen3.5-4B.yaml
# Qwen3-VL models
llamafactory-cli train train/Uni-MuMER-Qwen3-VL-2B.yaml
llamafactory-cli train train/Uni-MuMER-Qwen3-VL-4B.yamlKey differences from the original 3B config:
- Qwen3.5 models use template
qwen3_5_nothink; Qwen3-VL models useqwen3_vl_nothink - All new models use DeepSpeed ZeRO-3 (required for 8x A100 80GB)
transformers >= 5.0.0is required for Qwen3.5 architecture support- Qwen3.5 and Qwen3-VL use different tokenizers; their tokenized caches are not interchangeable
- Inference code and pretrained models.
- Evaluation code.
- Training code.
- Training data.
- Preprocess code.
Thanks to the following projects:
If you find Uni-MuMER useful for your study or research, please cite our paper with:
@inproceedings{li2025unimumer,
author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Baole and Zhou, Yuxuan and Gao, Liangcai},
booktitle = {Advances in Neural Information Processing Systems},
editor = {D. Belgrave and C. Zhang and H. Lin and R. Pascanu and P. Koniusz and M. Ghassemi and N. Chen},
pages = {129040--129074},
publisher = {Curran Associates, Inc.},
title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
url = {https://proceedings.neurips.cc/paper_files/paper/2025/file/bb992de895e886c2be79985835cb0ea4-Paper-Conference.pdf},
volume = {38},
year = {2025}
}

