Wenyang Gao*¹², Yinghao Yang*², Xi Jin¹, Jing Li³, Yue Zhang²
¹ Zhejiang University | ² School of Engineering, Westlake University | ³ Sichuan Lan-bridge Information Technology Co., Ltd.
* Equal contribution
A reference-free, criterion-driven pairwise evaluation framework for machine translation.
Illustrative case of regression-based, error-based, ranking-based evaluation, and our proposed fine-grained ranking evaluation (FiRE).
Reliable evaluation is central to the development of high-quality machine translation (MT) systems. As modern MT systems and large language models increasingly produce fluent and competitive translations, traditional automatic metrics often become less effective at distinguishing subtle quality differences. Overlap-based metrics such as BLEU provide limited semantic sensitivity, while regression-based metrics produce scalar scores that may obscure fine-grained trade-offs between translations. Error-based methods offer richer diagnostic information, but their aggregated scores are not directly optimized for pairwise preference evaluation.
FiRE addresses this gap by formulating machine translation evaluation as a fine-grained, reference-free pairwise ranking problem. Given a source sentence and two translation candidates, FiRE asks an evaluator to compare the candidates under explicit criteria, including:
- Faithfulness: whether the translation accurately preserves the source meaning;
- Fluency: whether the translation is natural, readable, and grammatically well-formed;
- Consistency of Style: whether the translation preserves the tone, register, and stylistic characteristics of the source;
- Overall Quality: a synthesized or directly judged holistic preference.
Instead of relying on a single overall judgment, FiRE decomposes translation quality into complementary dimensions and then aggregates criterion-level judgments into an overall decision. This design improves interpretability, provides more actionable diagnostic signals, and better reflects the multi-dimensional nature of human translation preferences.
The repository provides code for running FiRE-style evaluation with either API-based LLM evaluators or local vLLM-backed models, together with human-annotated benchmark data and scripts for evaluation and scoring.
The paper reports several key findings:
- Fine-grained pairwise evaluation improves alignment with human preferences. FiRE consistently outperforms error-based evaluation methods across faithfulness, fluency, and consistency of style on ranked pairwise data.
| EN→ZH | RU→ZH | |||||
|---|---|---|---|---|---|---|
| Faithfulness | Fluency | Cons. of Style | Faithfulness | Fluency | Cons. of Style | |
| Error-Based | ||||||
| M-MAD | 45.9 | 25.2 | 19.3 | 55.4 | 24.9 | 17.5 |
| GEMBA-MQM | 37.9 | 32.9 | 3.0 | 39.8 | 29.9 | 5.4 |
| Ranking-Based | ||||||
| DeepSeek-R1-FiRE | 64.8 | 68.7 | 61.4 | 72.5 | 77.9 | 66.3 |
- Criterion-aware aggregation improves overall ranking. Aggregating fine-grained judgments from faithfulness, fluency, and consistency of style yields stronger overall pairwise decisions than direct holistic ranking in several settings.
| EN→ZH | RU→ZH | |
|---|---|---|
| Regression-Based | ||
| KIWI-XXL | 61.4 | 61.2 |
| XCOMET-XXL | 55.7 | 58.0 |
| MetricX-24-XXL | 61.6 | 67.1 |
| Error-Based | ||
| M-MAD | 43.6 | 51.9 |
| GEMBA-MQM | 41.5 | 37.6 |
| Ranking-Based | ||
| Ranker-XXL | 60.7 | 61.6 |
| DeepSeek-R1-Direct-Rank | 64.3 | 66.7 |
| DeepSeek-R1-FiRE | 65.3 | 70.1 |
- FiRE provides interpretable system-level diagnosis. Beyond producing an overall ranking, FiRE reveals where MT systems gain or lose performance across different quality dimensions.
Fine-grained ranking of six MT systems based on all pairwise data in EN→ZH (left) and RU→ZH (right).
We recommend using a clean Python environment.
python -m venv .venv
source .venv/bin/activateOr with conda:
conda create -n fire python=3.12
conda activate fireInstall the required dependencies:
pip install -r requirements.txtIf you are using a mirror source, for example the Tsinghua PyPI mirror, you can run:
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simpleUse --mode api when calling an OpenAI-compatible API endpoint.
python run.py \
--src-language English \
--tgt-language Chinese \
--dataset tied \
--mode api \
--model-name Qwen/Qwen3.6-35B-A3B \
--api-key sk-**** \
--api-url https://***/v1Use --mode vllm when running a local model through vLLM.
Before running vLLM mode, install vLLM separately:
pip install vllmThen run:
python run.py \
--src-language Russian \
--tgt-language Chinese \
--dataset tied \
--mode vllm \
--model-name QwQ-32B \
--model-path path-to-modelThe --dataset argument controls which subset of the benchmark is evaluated:
| Option | Meaning |
|---|---|
tied |
Tie cases where human annotators judge two translations as equivalent |
ranked |
Distinguishable cases where one translation is preferred |
all |
All evaluation cases |
| Argument | Description |
|---|---|
--src-language |
Source language name, e.g.,English, Russian, Japanese |
--tgt-language |
Target language name, e.g.,Chinese |
--dataset |
Evaluation split:all, ranked, or tied |
--mode |
Evaluation backend:api or vllm |
--model-name |
Model name used for API calls and output organization |
--api-key |
API key for API mode |
--api-url |
Base URL for an OpenAI-compatible API endpoint |
--model-path |
Local model path for vLLM mode |
--preferences |
Evaluation criteria; defaults to all supported preferences |
--temperature |
Sampling temperature for LLM generation |
Example with selected preferences:
python run.py \
--src-language English \
--tgt-language Chinese \
--dataset ranked \
--mode api \
--model-name your-model-name \
--api-key sk-**** \
--api-url https://***/v1 \
--preferences Faithfulness Fluency Overall \
--temperature 0.6.
├── annotation/
│ ├── en-zh/
│ │ ├── en-zh-annotation.json
│ │ ├── en-zh-annotation-all.json
│ │ ├── en-zh-annotation-ranked.json
│ │ └── en-zh-annotation-tied.json
│ ├── ja-zh/
│ ├── ru-zh/
│ └── ...
│
├── outputs/
│
├── scripts/
│ ├── extract.py
│ ├── fire_eval.py
│ ├── get_vllm.py
│ ├── prompt.py
│ └── scripts.py
│
├── run.py
├── requirements.txt
└── README.md
annotation/Human annotations and benchmark data organized by language pair. Each language-pair directory contains the full annotation file and split files forall,ranked, andtiedcases.outputs/Generated evaluation outputs. Model predictions and scoring results are saved here during evaluation.scripts/extract.pyUtility script for extracting and splitting annotation data into different subsets.scripts/fire_eval.pyCore FiRE evaluation logic, including dataset loading, API/vLLM querying, result caching, aggregation, and scoring.scripts/get_vllm.pyvLLM wrapper for loading local models and generating batched responses.scripts/prompt.pyPrompt templates and parsing utilities for criterion-based pairwise evaluation.scripts/scripts.pyAdditional evaluation utilities.run.pyMain command-line entry point for running FiRE evaluation.requirements.txtPython dependencies required for running the repository.
Running run.py will evaluate the selected dataset split and save model outputs under outputs/.
A typical output entry contains:
- source sentence;
- translation A and translation B;
- predicted preference label;
- human gold label;
- model output content;
- optional reasoning content, if returned by the evaluator.
The preference labels are:
| Label | Meaning |
|---|---|
A |
Translation A is better |
B |
Translation B is better |
E |
Translation A and Translation B are equivalent |
For questions, please contact:
gaowenyang@westlake.edu.cn