Skip to content

wygao8/FiRE-MT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FiRE: Fine-grained Ranking Evaluation for Machine Translation

ICML 2026 Accepted Python 3.12 License

Wenyang Gao*¹², Yinghao Yang*², Xi Jin¹, Jing Li³, Yue Zhang²

¹ Zhejiang University  |  ² School of Engineering, Westlake University  |  ³ Sichuan Lan-bridge Information Technology Co., Ltd.

* Equal contribution

A reference-free, criterion-driven pairwise evaluation framework for machine translation.

Introduction


Illustrative case of regression-based, error-based, ranking-based evaluation, and our proposed fine-grained ranking evaluation (FiRE).

Reliable evaluation is central to the development of high-quality machine translation (MT) systems. As modern MT systems and large language models increasingly produce fluent and competitive translations, traditional automatic metrics often become less effective at distinguishing subtle quality differences. Overlap-based metrics such as BLEU provide limited semantic sensitivity, while regression-based metrics produce scalar scores that may obscure fine-grained trade-offs between translations. Error-based methods offer richer diagnostic information, but their aggregated scores are not directly optimized for pairwise preference evaluation.

FiRE addresses this gap by formulating machine translation evaluation as a fine-grained, reference-free pairwise ranking problem. Given a source sentence and two translation candidates, FiRE asks an evaluator to compare the candidates under explicit criteria, including:

  • Faithfulness: whether the translation accurately preserves the source meaning;
  • Fluency: whether the translation is natural, readable, and grammatically well-formed;
  • Consistency of Style: whether the translation preserves the tone, register, and stylistic characteristics of the source;
  • Overall Quality: a synthesized or directly judged holistic preference.

Instead of relying on a single overall judgment, FiRE decomposes translation quality into complementary dimensions and then aggregates criterion-level judgments into an overall decision. This design improves interpretability, provides more actionable diagnostic signals, and better reflects the multi-dimensional nature of human translation preferences.

The repository provides code for running FiRE-style evaluation with either API-based LLM evaluators or local vLLM-backed models, together with human-annotated benchmark data and scripts for evaluation and scoring.

Some Main Findings

The paper reports several key findings:

  • Fine-grained pairwise evaluation improves alignment with human preferences. FiRE consistently outperforms error-based evaluation methods across faithfulness, fluency, and consistency of style on ranked pairwise data.
Percentage agreement (%) between model evaluators and human annotations on ranked pairwise data. Bold indicates the best performance per criterion and language direction.
EN→ZH RU→ZH
FaithfulnessFluencyCons. of Style FaithfulnessFluencyCons. of Style
Error-Based
M-MAD45.925.219.355.424.917.5
GEMBA-MQM37.932.93.039.829.95.4
Ranking-Based
DeepSeek-R1-FiRE64.868.761.472.577.966.3
  • Criterion-aware aggregation improves overall ranking. Aggregating fine-grained judgments from faithfulness, fluency, and consistency of style yields stronger overall pairwise decisions than direct holistic ranking in several settings.
Percentage agreement (%) between model evaluators and human annotations on ranked overall pairwise data. Bold indicates the best performance per language direction.
EN→ZH RU→ZH
Regression-Based
KIWI-XXL61.461.2
XCOMET-XXL55.758.0
MetricX-24-XXL61.667.1
Error-Based
M-MAD43.651.9
GEMBA-MQM41.537.6
Ranking-Based
Ranker-XXL60.761.6
DeepSeek-R1-Direct-Rank64.366.7
DeepSeek-R1-FiRE65.370.1
  • FiRE provides interpretable system-level diagnosis. Beyond producing an overall ranking, FiRE reveals where MT systems gain or lose performance across different quality dimensions.


Fine-grained ranking of six MT systems based on all pairwise data in EN→ZH (left) and RU→ZH (right).

Quick Start

Step 1. Install Dependencies

We recommend using a clean Python environment.

python -m venv .venv
source .venv/bin/activate

Or with conda:

conda create -n fire python=3.12
conda activate fire

Install the required dependencies:

pip install -r requirements.txt

If you are using a mirror source, for example the Tsinghua PyPI mirror, you can run:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Step 2. Run FiRE in API Mode

Use --mode api when calling an OpenAI-compatible API endpoint.

python run.py \
    --src-language English \
    --tgt-language Chinese \
    --dataset tied \
    --mode api \
    --model-name Qwen/Qwen3.6-35B-A3B \
    --api-key sk-**** \
    --api-url https://***/v1

Step 3. Run FiRE in vLLM Mode

Use --mode vllm when running a local model through vLLM.

Before running vLLM mode, install vLLM separately:

pip install vllm

Then run:

python run.py \
    --src-language Russian \
    --tgt-language Chinese \
    --dataset tied \
    --mode vllm \
    --model-name QwQ-32B \
    --model-path path-to-model

Dataset Options

The --dataset argument controls which subset of the benchmark is evaluated:

Option Meaning
tied Tie cases where human annotators judge two translations as equivalent
ranked Distinguishable cases where one translation is preferred
all All evaluation cases

Main Arguments

Argument Description
--src-language Source language name, e.g.,English, Russian, Japanese
--tgt-language Target language name, e.g.,Chinese
--dataset Evaluation split:all, ranked, or tied
--mode Evaluation backend:api or vllm
--model-name Model name used for API calls and output organization
--api-key API key for API mode
--api-url Base URL for an OpenAI-compatible API endpoint
--model-path Local model path for vLLM mode
--preferences Evaluation criteria; defaults to all supported preferences
--temperature Sampling temperature for LLM generation

Example with selected preferences:

python run.py \
    --src-language English \
    --tgt-language Chinese \
    --dataset ranked \
    --mode api \
    --model-name your-model-name \
    --api-key sk-**** \
    --api-url https://***/v1 \
    --preferences Faithfulness Fluency Overall \
    --temperature 0.6

Repository Structure

.
├── annotation/
│   ├── en-zh/
│   │   ├── en-zh-annotation.json
│   │   ├── en-zh-annotation-all.json
│   │   ├── en-zh-annotation-ranked.json
│   │   └── en-zh-annotation-tied.json
│   ├── ja-zh/
│   ├── ru-zh/
│   └── ...
│
├── outputs/
│
├── scripts/
│   ├── extract.py
│   ├── fire_eval.py
│   ├── get_vllm.py
│   ├── prompt.py
│   └── scripts.py
│
├── run.py
├── requirements.txt
└── README.md

Directory and File Descriptions

  • annotation/Human annotations and benchmark data organized by language pair. Each language-pair directory contains the full annotation file and split files for all, ranked, and tied cases.
  • outputs/Generated evaluation outputs. Model predictions and scoring results are saved here during evaluation.
  • scripts/extract.pyUtility script for extracting and splitting annotation data into different subsets.
  • scripts/fire_eval.pyCore FiRE evaluation logic, including dataset loading, API/vLLM querying, result caching, aggregation, and scoring.
  • scripts/get_vllm.pyvLLM wrapper for loading local models and generating batched responses.
  • scripts/prompt.pyPrompt templates and parsing utilities for criterion-based pairwise evaluation.
  • scripts/scripts.pyAdditional evaluation utilities.
  • run.pyMain command-line entry point for running FiRE evaluation.
  • requirements.txt Python dependencies required for running the repository.

Output Format

Running run.py will evaluate the selected dataset split and save model outputs under outputs/.

A typical output entry contains:

  • source sentence;
  • translation A and translation B;
  • predicted preference label;
  • human gold label;
  • model output content;
  • optional reasoning content, if returned by the evaluator.

The preference labels are:

Label Meaning
A Translation A is better
B Translation B is better
E Translation A and Translation B are equivalent

Contact

For questions, please contact:

gaowenyang@westlake.edu.cn

About

[ICML 2026] The official repository of "FiRE: Fine-Grained Ranking Evaluation for Machine Translation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages