PEGRL

Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL is a two-stage reinforcement learning (RL) framework for machine translation that leverages post-editing as an auxiliary task to stabilize training and guide optimization. While RL has shown strong promise in LLM-based MT (e.g., GRPO), translation-oriented RL suffers from noisy Monte Carlo return estimates and large trajectory spaces, often favoring global over local optimization.

PEGRL addresses these challenges by:

Post-editing as auxiliary supervision – Translation outputs are sampled to construct post-editing inputs, enabling return estimation conditioned on current translation behavior.
Balancing exploration and local optimization – Supports global exploration while promoting fine-grained improvements.
Task-specific weighting – Translation and post-editing objectives are combined with a weighting scheme, yielding a biased yet more sample-efficient estimator.

Results: Experiments on English→Finnish, English→Turkish, and English↔Chinese show consistent improvements over standard RL baselines. Notably, English→Turkish COMET-KIWI performance approaches state-of-the-art LLM systems (DeepSeek-V3.2: 68.14 vs. 68.13).

Environment

Our software environment is based on Python 3.11. The main upper-level Python packages include verl, sacrebleu, and unbabel-comet. You can build the software environment using the following script:

USE_SGLANG=0 USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install sacrebleu "sacrebleu[ja]" "sacrebleu[ko]" unbabel-comet

We provide a Conda environment file for reference. We recommend using the installation script above for setup. The env.yaml file is intended for reference only.

Model

The project uses both policy models and external reward models. For policy models, we recommend creating symbolic links under the models directory. External reward models include COMETKiwi and XCOMET (we recommend managing COMET models using $HF_HOME):

ln -sfn path/to/your/locate/model models/Qwen3-0.6B
ln -sfn path/to/your/locate/model models/Qwen3-4B
ln -sfn path/to/your/locate/model models/Qwen3-8B

Make sure you know the location specified by the HF_HOME environment variable:

hf download Unbabel/wmt23-cometkiwi-da-xl
hf download Unbabel/XCOMET-XL

If hf download is used without specifying the --local-dir parameter, you do not need to manually set the COMET model path. Otherwise, if you:

Want to use additional COMET models, or
Need to specify a custom local path for COMET models,

please refer to comet.yaml.

Tip

In multi-GPU settings, using many COMET models is not recommended. The current implementation loads one COMET model per GPU, which can consume a large amount of memory even before the models are invoked.

Data

Simply place the folder from the link into the project root directory, i.e.:

project-root/
├── data/
│   ├── train/
│   ├── test/
│   └── ...
├── verl/
└── scripts/

Scripts

All experiment scripts are organized by category under train_scripts. The main subdirectory contains the primary experiments, ablation contains the main ablation studies, and weight contains weight analysis experiments. All scripts should be executed from the project root directory. For example:

bash train_scripts/main/train_0_4B_en2fi_ours.sh

For the experiment scripts, you may adjust two per-GPU batch parameters based on the available GPU memory:

ppo_micro_batch_size_per_gpu:
- 32 for A6000 (48G)
- 64 for A100 (80G)
- 128 for H20 (96G)
forward_micro_batch_size (in comet.yaml):
- 64 for A6000 (48G)
- 128 for A100 (80G), H20 (96G)

Other parameters, such as project/experiment naming and model checkpoint saving frequency, can be modified if necessary, but changes beyond these are not recommended.

Q&A

If you have any questions or run into issues, please feel free to open an issue for discussion. You can also reach me at shenyunzhi@smail.nju.edu.cn.

Citation

If you find this repository useful, please consider citing:

@misc{shen2026pegrlimprovingmachinetranslation,
  title={PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning},
  author={Yunzhi Shen and Hao Zhou and Xin Huang and Xue Han and Junlan Feng and Shujian Huang},
  year={2026},
  eprint={2602.03352},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.03352}
}

Acknowledgments

Our codebase is built upon the work of MT-R1-Zero and Verl. We extend our sincere thanks to their projects.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
examples		examples
extra		extra
models		models
recipe		recipe
scripts		scripts
tests		tests
train_scripts		train_scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
README_VERL.md		README_VERL.md
env.yaml		env.yaml
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PEGRL

Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Environment

Model

Data

Scripts

Q&A

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

NJUNLP/peg-rl

Folders and files

Latest commit

History

Repository files navigation

PEGRL

Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Environment

Model

Data

Scripts

Q&A

Citation

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages