Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Description

We introduce Uni-MuMER, which fully fine-tunes the Qwen2.5-VL-3B model for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.

We further extend Uni-MuMER to newer backbone architectures (Qwen3.5 and Qwen3-VL) and provide a family of fine-tuned models for the community.

Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model, SSAN, by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting.

📢 Updates

2026-04-13: Release 4 new model variants: Qwen3.5-2B, Qwen3.5-4B, Qwen3-VL-2B, Qwen3-VL-4B. See Model Zoo
2026-03-29: Release preprocessing scripts for UniMuMER-Tree, symbol-counting data construction, and the MathNet-based HMER prompt-data pipeline. See Preprocessing
2025-09-18: This work got accepted to NeurIPS 2025 as a Spotlight (688/21575).
2025-09-09 : Release dataset (Uni-MuMER-Data and valid/test data) and training code. See Training
2025-06-02: Release of model weights and inference scripts.

📦 Dataset Preparation

Download data.zip from GitHub, Huggingface, or Google Drive link.
Unzip it at the project root. After extraction, you should have:

data
├── CROHME/
├── CROHME2023/
├── HME100K/
├── Im2LaTeXv2/
├── MathWriting/
└── MNE/

🛠 Preprocessing

We provide preprocessing scripts for the released UniMuMER-Tree and symbol-counting data construction pipeline in preprocess/.

The current release includes:

tree: generation of the UniMuMER-Tree supervision target from tokenized LaTeX
can: generation of symbol-counting supervision paired with the corresponding LaTeX target
make_mathnet_hmer_data.py: port of the MathNet-based caption cleaning and HMER prompt-data pipeline from mathnet-ly/0410_proc_file.ipynb
preprocess/MathNet4clean: a pinned submodule for external MathNet preprocessing utilities and normalization reference code

The main entry points are:

python preprocess/make_unimumer_tree_data.py --task tree --input <input-json> --output <output-json>

python preprocess/make_mathnet_hmer_data.py process \
  --input <caption-txt-or-json> \
  --output-dir <output-dir> \
  --image-base-path <image-base-path> \
  --mathnet-root preprocess/MathNet4clean

For example:

python preprocess/make_unimumer_tree_data.py \
  --task tree \
  --input preprocess/examples/unimumer_tree/sample_input.json \
  --output preprocess/examples/unimumer_tree/sample_tree.json

Additional usage details, implementation notes, and ready-to-run examples are provided in preprocess/README.md, preprocess/examples/unimumer_tree/, and preprocess/examples/mathnet_hmer/.

For the MathNet-based preprocessing dependency, initialize submodules after cloning:

git submodule update --init --recursive

The Uni-MuMER-specific MathNet normalization path is implemented in preprocess/MathNet4clean/preprocessing/improve_tokens_unimumer.py.

The MathNet HMER pipeline keeps the notebook-style intermediate outputs (step1 to step9, invalid_record.json, and <output_dir>_0425_final.json) while replacing the original hard-coded external MathNet path with the in-repo submodule.

🏃 Inference

After the dataset is in place, you can run batch inference over all three test sets with one of the two commands below.

Shell wrapper (recommended)

bash eval/eval_crohme.sh  -i <input-dir> -o <output-dir> -m <model> -b <batch_size>

Example

bash eval/eval_all.sh -m models/Uni-MuMER-3B -s test1 -b 32768

Direct Python call

python scripts/vllm_infer.py --input-dir <input-dir> --output-dir <output-dir> --model <model> --batch_size <batch_size>

Tip:

To select GPUs on multi‑GPU machines just export CUDA_VISIBLE_DEVICES before running the script, e.g., export CUDA_VISIBLE_DEVICES=1,2
For batch_size, you can use the --batch_size argument to control the number of samples per vLLM.generate() call. The default value is 32768, which is prevented from being too large to avoid OOM errors.

📏 Evaluation Metrics

This repository currently provides text-based evaluation through scripts/eval_metrics_calculator.py, including Edit Score, BLEU-4, CER, and exact-match rate.

We do not currently vendor the official CDM evaluator in this repository. The ExpRate@CDM results reported in our paper follow the visually grounded Character Detection Matching (CDM) protocol. As described in our paper, ExpRate@CDM is used as a visual-equivalence-aware accuracy metric beyond exact string match.

For reproducing ExpRate@CDM, please use the official UniMERNet CDM toolkit. The official implementation provides the evaluation entry point cdm/evaluation.py, and our inference outputs already contain the core fields (gt, pred, and img_id) expected by that evaluator.

Model Zoo

Model	Base	Params	Avg ExpRate	Avg CDM ExpRate	HuggingFace
Uni-MuMER-Qwen2.5-VL-3B	Qwen2.5-VL-3B-Instruct	3.4B	72.19%	80.75%	Link
Uni-MuMER-Qwen2.5-VL-7B	Qwen2.5-VL-7B-Instruct	8.3B	-	-	Link
Uni-MuMER-Qwen3.5-2B	Qwen3.5-2B	2.2B	73.09%	81.15%	Link
Uni-MuMER-Qwen3.5-4B	Qwen3.5-4B	4.5B	71.60%	81.19%	Link
Uni-MuMER-Qwen3-VL-2B	Qwen3-VL-2B-Instruct	2.1B	72.49%	80.35%	Link
Uni-MuMER-Qwen3-VL-4B	Qwen3-VL-4B-Instruct	4.4B	72.11%	80.67%	Link

Note: Qwen3.5 models require transformers >= 5.0.0. Qwen3-VL models use the qwen3_vl_nothink template; Qwen3.5 models use qwen3_5_nothink.

Benchmark Results

ExpRate (Exact Match Rate) on Main Test Sets

Dataset	Samples	Uni-MuMER-3B	Qwen3.5-2B	Qwen3.5-4B	Qwen3-VL-2B	Qwen3-VL-4B
CROHME 2014	986	82.25%	83.98%	82.56%	83.27%	82.35%
CROHME 2016	1,147	78.29%	81.17%	78.20%	78.55%	79.34%
CROHME 2019	1,199	79.82%	80.15%	75.98%	79.40%	78.98%
CROHME 2023 Test	2,300	69.52%	69.43%	66.74%	70.96%	69.17%
HME100K Test	24,607	69.50%	70.43%	70.02%	69.31%	69.79%
Im2LaTeXv2 Test	10,118	76.99%	77.82%	77.20%	77.11%	77.46%
MathWriting Test	7,643	53.03%	51.84%	54.32%	50.66%	53.15%
MNE-N1	1,875	75.89%	74.72%	69.76%	79.63%	70.51%
MNE-N2	304	57.89%	61.18%	54.93%	65.13%	52.96%
MNE-N3	1,464	46.72%	56.83%	56.69%	52.39%	54.44%
Average		72.19%	73.09%	71.60%	72.49%	72.11%

CDM ExpRate (Visual-Equivalence-Aware)

Dataset	Uni-MuMER-3B	Qwen3.5-2B	Qwen3.5-4B	Qwen3-VL-2B	Qwen3-VL-4B
CROHME 2014	85.40%	87.10%	87.90%	86.50%	86.00%
CROHME 2016	80.90%	82.70%	82.00%	81.50%	82.00%
CROHME 2019	82.70%	82.70%	82.80%	82.60%	82.20%
CROHME 2023 Test	79.00%	79.20%	78.00%	78.70%	79.10%
HME100K Test	72.50%	73.60%	73.00%	72.40%	72.90%
Im2LaTeXv2 Test	93.40%	93.70%	91.90%	92.60%	93.10%
MathWriting Test	72.50%	70.70%	74.30%	69.00%	71.70%
MNE-N1	81.20%	82.30%	82.30%	82.60%	81.10%
MNE-N2	71.10%	72.70%	74.30%	72.40%	71.70%
MNE-N3	67.30%	80.80%	78.50%	72.80%	76.00%
Average	80.75%	81.15%	81.19%	80.35%	80.67%

Evaluation: vLLM 0.19.0, temperature=0.2, max_tokens=2048. CDM computed with UniMERNet/cdm.

Training

Our training code depends on LLaMA-Factory.

For training dependencies, please refer to LLaMA-Factory or requirements_training.txt.

Uni-MuMER-3B (Qwen2.5-VL-3B, original)

llamafactory-cli train train/Uni-MuMER-train.yaml

New Model Variants (Qwen3.5 / Qwen3-VL)

Training configs for newer backbones are provided in train/:

# Qwen3.5 models (requires transformers >= 5.0.0)
llamafactory-cli train train/Uni-MuMER-Qwen3.5-2B.yaml
llamafactory-cli train train/Uni-MuMER-Qwen3.5-4B.yaml

# Qwen3-VL models
llamafactory-cli train train/Uni-MuMER-Qwen3-VL-2B.yaml
llamafactory-cli train train/Uni-MuMER-Qwen3-VL-4B.yaml

Key differences from the original 3B config:

Qwen3.5 models use template qwen3_5_nothink; Qwen3-VL models use qwen3_vl_nothink
All new models use DeepSpeed ZeRO-3 (required for 8x A100 80GB)
transformers >= 5.0.0 is required for Qwen3.5 architecture support
Qwen3.5 and Qwen3-VL use different tokenizers; their tokenized caches are not interchangeable

✅ TODO

🙏 Acknowledgements

Thanks to the following projects:

📝 Citation

If you find Uni-MuMER useful for your study or research, please cite our paper with:

@inproceedings{li2025unimumer,
 author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Baole and Zhou, Yuxuan and Gao, Liangcai},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {D. Belgrave and C. Zhang and H. Lin and R. Pascanu and P. Koniusz and M. Ghassemi and N. Chen},
 pages = {129040--129074},
 publisher = {Curran Associates, Inc.},
 title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
 url = {https://proceedings.neurips.cc/paper_files/paper/2025/file/bb992de895e886c2be79985835cb0ea4-Paper-Conference.pdf},
 volume = {38},
 year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
asserts/fig		asserts/fig
eval		eval
example_data		example_data
preprocess		preprocess
scripts		scripts
train		train
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
requirements.txt		requirements.txt
requirements_training.txt		requirements_training.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Description

📢 Updates

📦 Dataset Preparation

🛠 Preprocessing

🏃 Inference

Shell wrapper (recommended)

Direct Python call

📏 Evaluation Metrics

Model Zoo

Benchmark Results

ExpRate (Exact Match Rate) on Main Test Sets

CDM ExpRate (Visual-Equivalence-Aware)

Training

Uni-MuMER-3B (Qwen2.5-VL-3B, original)

New Model Variants (Qwen3.5 / Qwen3-VL)

✅ TODO

🙏 Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Description

📢 Updates

📦 Dataset Preparation

🛠 Preprocessing

🏃 Inference

Shell wrapper (recommended)

Direct Python call

📏 Evaluation Metrics

Model Zoo

Benchmark Results

ExpRate (Exact Match Rate) on Main Test Sets

CDM ExpRate (Visual-Equivalence-Aware)

Training

Uni-MuMER-3B (Qwen2.5-VL-3B, original)

New Model Variants (Qwen3.5 / Qwen3-VL)

✅ TODO

🙏 Acknowledgements

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages