FiRE: Fine-grained Ranking Evaluation for Machine Translation

Wenyang Gao*¹², Yinghao Yang*², Xi Jin¹, Jing Li³, Yue Zhang²

¹ Zhejiang University | ² School of Engineering, Westlake University | ³ Sichuan Lan-bridge Information Technology Co., Ltd.

* Equal contribution

A reference-free, criterion-driven pairwise evaluation framework for machine translation.

Introduction

Illustrative case of regression-based, error-based, ranking-based evaluation, and our proposed fine-grained ranking evaluation (FiRE).

Reliable evaluation is central to the development of high-quality machine translation (MT) systems. As modern MT systems and large language models increasingly produce fluent and competitive translations, traditional automatic metrics often become less effective at distinguishing subtle quality differences. Overlap-based metrics such as BLEU provide limited semantic sensitivity, while regression-based metrics produce scalar scores that may obscure fine-grained trade-offs between translations. Error-based methods offer richer diagnostic information, but their aggregated scores are not directly optimized for pairwise preference evaluation.

FiRE addresses this gap by formulating machine translation evaluation as a fine-grained, reference-free pairwise ranking problem. Given a source sentence and two translation candidates, FiRE asks an evaluator to compare the candidates under explicit criteria, including:

Faithfulness: whether the translation accurately preserves the source meaning;
Fluency: whether the translation is natural, readable, and grammatically well-formed;
Consistency of Style: whether the translation preserves the tone, register, and stylistic characteristics of the source;
Overall Quality: a synthesized or directly judged holistic preference.

Instead of relying on a single overall judgment, FiRE decomposes translation quality into complementary dimensions and then aggregates criterion-level judgments into an overall decision. This design improves interpretability, provides more actionable diagnostic signals, and better reflects the multi-dimensional nature of human translation preferences.

The repository provides code for running FiRE-style evaluation with either API-based LLM evaluators or local vLLM-backed models, together with human-annotated benchmark data and scripts for evaluation and scoring.

Some Main Findings

The paper reports several key findings:

Fine-grained pairwise evaluation improves alignment with human preferences. FiRE consistently outperforms error-based evaluation methods across faithfulness, fluency, and consistency of style on ranked pairwise data.

Percentage agreement (%) between model evaluators and human annotations on ranked pairwise data. Bold indicates the best performance per criterion and language direction.

	EN→ZH			RU→ZH
	Faithfulness	Fluency	Cons. of Style	Faithfulness	Fluency	Cons. of Style
Error-Based
M-MAD	45.9	25.2	19.3	55.4	24.9	17.5
GEMBA-MQM	37.9	32.9	3.0	39.8	29.9	5.4
Ranking-Based
DeepSeek-R1-FiRE	64.8	68.7	61.4	72.5	77.9	66.3

Criterion-aware aggregation improves overall ranking. Aggregating fine-grained judgments from faithfulness, fluency, and consistency of style yields stronger overall pairwise decisions than direct holistic ranking in several settings.

Percentage agreement (%) between model evaluators and human annotations on ranked overall pairwise data. Bold indicates the best performance per language direction.

	EN→ZH	RU→ZH
Regression-Based
KIWI-XXL	61.4	61.2
XCOMET-XXL	55.7	58.0
MetricX-24-XXL	61.6	67.1
Error-Based
M-MAD	43.6	51.9
GEMBA-MQM	41.5	37.6
Ranking-Based
Ranker-XXL	60.7	61.6
DeepSeek-R1-Direct-Rank	64.3	66.7
DeepSeek-R1-FiRE	65.3	70.1

FiRE provides interpretable system-level diagnosis. Beyond producing an overall ranking, FiRE reveals where MT systems gain or lose performance across different quality dimensions.

Fine-grained ranking of six MT systems based on all pairwise data in EN→ZH (left) and RU→ZH (right).

Quick Start

Step 1. Install Dependencies

We recommend using a clean Python environment.

python -m venv .venv
source .venv/bin/activate

Or with conda:

conda create -n fire python=3.12
conda activate fire

Install the required dependencies:

pip install -r requirements.txt

If you are using a mirror source, for example the Tsinghua PyPI mirror, you can run:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Step 2. Run FiRE in API Mode

Use --mode api when calling an OpenAI-compatible API endpoint.

python run.py \
    --src-language English \
    --tgt-language Chinese \
    --dataset tied \
    --mode api \
    --model-name Qwen/Qwen3.6-35B-A3B \
    --api-key sk-**** \
    --api-url https://***/v1

Step 3. Run FiRE in vLLM Mode

Use --mode vllm when running a local model through vLLM.

Before running vLLM mode, install vLLM separately:

pip install vllm

Then run:

python run.py \
    --src-language Russian \
    --tgt-language Chinese \
    --dataset tied \
    --mode vllm \
    --model-name QwQ-32B \
    --model-path path-to-model

Dataset Options

The --dataset argument controls which subset of the benchmark is evaluated:

Option	Meaning
`tied`	Tie cases where human annotators judge two translations as equivalent
`ranked`	Distinguishable cases where one translation is preferred
`all`	All evaluation cases

Main Arguments

Argument	Description
`--src-language`	Source language name, e.g.,`English`, `Russian`, `Japanese`
`--tgt-language`	Target language name, e.g.,`Chinese`
`--dataset`	Evaluation split:`all`, `ranked`, or `tied`
`--mode`	Evaluation backend:`api` or `vllm`
`--model-name`	Model name used for API calls and output organization
`--api-key`	API key for API mode
`--api-url`	Base URL for an OpenAI-compatible API endpoint
`--model-path`	Local model path for vLLM mode
`--preferences`	Evaluation criteria; defaults to all supported preferences
`--temperature`	Sampling temperature for LLM generation

Example with selected preferences:

python run.py \
    --src-language English \
    --tgt-language Chinese \
    --dataset ranked \
    --mode api \
    --model-name your-model-name \
    --api-key sk-**** \
    --api-url https://***/v1 \
    --preferences Faithfulness Fluency Overall \
    --temperature 0.6

Repository Structure

.
├── annotation/
│   ├── en-zh/
│   │   ├── en-zh-annotation.json
│   │   ├── en-zh-annotation-all.json
│   │   ├── en-zh-annotation-ranked.json
│   │   └── en-zh-annotation-tied.json
│   ├── ja-zh/
│   ├── ru-zh/
│   └── ...
│
├── outputs/
│
├── scripts/
│   ├── extract.py
│   ├── fire_eval.py
│   ├── get_vllm.py
│   ├── prompt.py
│   └── scripts.py
│
├── run.py
├── requirements.txt
└── README.md

Directory and File Descriptions

annotation/Human annotations and benchmark data organized by language pair. Each language-pair directory contains the full annotation file and split files for all, ranked, and tied cases.
outputs/Generated evaluation outputs. Model predictions and scoring results are saved here during evaluation.
scripts/extract.pyUtility script for extracting and splitting annotation data into different subsets.
scripts/fire_eval.pyCore FiRE evaluation logic, including dataset loading, API/vLLM querying, result caching, aggregation, and scoring.
scripts/get_vllm.pyvLLM wrapper for loading local models and generating batched responses.
scripts/prompt.pyPrompt templates and parsing utilities for criterion-based pairwise evaluation.
scripts/scripts.pyAdditional evaluation utilities.
run.pyMain command-line entry point for running FiRE evaluation.
requirements.txt Python dependencies required for running the repository.

Output Format

Running run.py will evaluate the selected dataset split and save model outputs under outputs/.

A typical output entry contains:

source sentence;
translation A and translation B;
predicted preference label;
human gold label;
model output content;
optional reasoning content, if returned by the evaluator.

The preference labels are:

Label	Meaning
`A`	Translation A is better
`B`	Translation B is better
`E`	Translation A and Translation B are equivalent

Contact

For questions, please contact:

gaowenyang@westlake.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
annotation		annotation
figures		figures
scripts		scripts
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FiRE: Fine-grained Ranking Evaluation for Machine Translation

Introduction

Some Main Findings

Quick Start

Step 1. Install Dependencies

Step 2. Run FiRE in API Mode

Step 3. Run FiRE in vLLM Mode

Dataset Options

Main Arguments

Repository Structure

Directory and File Descriptions

Output Format

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FiRE: Fine-grained Ranking Evaluation for Machine Translation

Introduction

Some Main Findings

Quick Start

Step 1. Install Dependencies

Step 2. Run FiRE in API Mode

Step 3. Run FiRE in vLLM Mode

Dataset Options

Main Arguments

Repository Structure

Directory and File Descriptions

Output Format

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages