Official implementation of "Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training"
Long reasoning models often struggle with multilingual settings: they tend to reason in English for non-English questions, and when forced to reason in the question language, performance drops substantially. TRIT addresses this by integrating translation training directly into multilingual reasoning through a self-improving reinforcement learning framework.
Key Innovation: TRIT creates a closed feedback loop where:
- Translation provides multilingual question data for reasoning
- Reasoning accuracy provides quality signals for translation
- No external feedback or additional multilingual data required
Two-Stage Process:
- Cross-Lingual Reasoning: Filter questions by accuracy threshold to ensure reliable feedback
- Translation-Reasoning Integration: Train translation and reasoning jointly, creating mutual improvement
All tasks are optimized using GRPO (Group Relative Policy Optimization).
TRIT achieves:
- +7 percentage points average improvement over SLC-RL baseline across three models
- +5 percentage points over M-Thinker on Qwen3 models
- Near-perfect language consistency (>99%) across all settings
TRIT improves translation quality both in-domain and out-of-domain:
- In-domain (MATH500): Up to 3.3:1 win-to-loss ratio vs baseline
- Out-of-domain (FLORES-200): Up to +8.4 COMET points
Translation training induces question-level alignment:
- +15.9 percentage points improvement in final-layer similarity (DeepSeek-Distill-1.5B)
- Substantially higher alignment across all model layers compared to External-Translation baseline
TRIT remains effective even when reasoning language is not constrained:
- 52.1% accuracy when models can reason in any language (Qwen3-1.7B)
- +4.1pp improvement over SLC-RL in flexible setting
- Demonstrates TRIT improves question understanding, not just language consistency
Optimal filtering threshold θ = 1/3 balances noise reduction and data retention.
git clone https://github.com/NJUNLP/TRIT.git
cd TRIT
pip install -r requirements.txt
Download training data from Hugging Face.
Stage 1: Cold-start Training
We use LlamaFactory for supervised fine-tuning. Configuration: scripts/sft.yaml
llamafactory-cli train scripts/sft.yaml
Stage 2: TRIT Training
We use VeRL for reinforcement learning. Example script: scripts/example.sh
bash scripts/example.sh
-
Translation-Reasoning Integration is the Core Innovation
- Translation training improves question understanding
- Reasoning feedback guides translation quality
- Joint optimization creates self-improving loop
-
Question-Level Alignment Matters
- TRIT induces aligned cross-lingual representations
- External translations alone don't achieve this alignment
- Alignment correlates with reasoning improvements
-
Framework is Flexible and Robust
- Works across models with varying multilingual capabilities
- Effective in both constrained and flexible reasoning settings
- Supports iterative training for continual improvement
This work is supported by National Science Foundation of China (No. 62376116), research project of Nanjing University-China Mobile Joint Institute (NJ20250038), and the Fundamental Research Funds for the Central Universities (No. 2024300507).
We thank the authors of DAPO-MATH, M-Thinker, and GRPO for their open-source contributions.
This project is licensed under the MIT License - see the LICENSE file for details.





