Say Goodbye to Scalar Reward Model, Exploring the Aha Moment in General RL
OpenRS (Open Rubric System) is an LLM-as-a-Judge evaluation framework that replaces traditional Reward Models with adaptive, fine-grained rubric-based evaluation. The core idea is to use large language models to evaluate response quality through adaptive, query-type-specific rubrics — enabling multi-dimensional scoring with interpretable verdicts.
Figure 1: OpenRS Evaluation Pipeline — From pairwise responses, through verifiable and adaptive rubric generation, to multi-criteria scoring.
- 🎯 Open Rubric: 50+ query-type-specific rubrics with weighted criteria (critical / core / important / highlight)
- ⚖️ Bi-directional Debiasing: Swaps A/B order to eliminate position bias
- 🔍 Critical Flaw Veto: Fatal errors override all other scoring dimensions
- 📊 4 Benchmarks: JudgeBench, PPE, RewardBench V2, RMBench
We evaluate five judge models across four benchmarks:
Table 1: Accuracy (%) of different judge models across four benchmarks.
git clone https://github.com/Qwen-Applications/OpenRS.git
cd OpenRS
pip install -r requirements.txtOpenRS is compatible with any OpenAI-compatible inference backend (vLLM, SGLang, Ollama, etc.):
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL_NAME="your-model-name"- JudgeBench / PPE
python judgebench_and_ppe.py \
--input data/judgebench/gpt.jsonl \
--output-dir results/judgebench \
--annotation judgebench_gpt \
--workers 50- RewardBench V2
python rewardbench_v2.py \
--input data/rewardbench_v2/rewardbench_v2.jsonl \
--output-dir results/rewardbench_v2 \
--annotation rbv2 \
--workers 10- RMBench
python rmbench.py \
--input data/rmbench/rmbench.jsonl \
--output results/rmbench_results.jsonl \
--workers 10| Argument | Description | Default |
|---|---|---|
--input |
Input data path | required |
--output-dir |
Output directory | ./results |
--workers |
Concurrent threads | 10 |
--temperature |
Generation temperature | 0.0 |
--limit |
Max items to process (0=all) | 0 |
--no-resume |
Disable checkpoint resume | False |
--stats-only |
Report stats without running | False |
OpenRS/
├── tools.py # API calls, JSON parsing, file I/O
├── evaluator.py # Core evaluation interface (evaluate_pair)
├── evaluator_precise_if.py # Precise IF (Instruction Following) evaluator
├── robust_utils.py # Robustness utilities (Unicode/JSON tolerance)
│
├── judgebench_and_ppe.py # JudgeBench / PPE evaluation script
├── rewardbench_v2.py # RewardBench V2 evaluation script
├── rmbench.py # RMBench evaluation script
│
├── prompts/
│ ├── pairwise_prompts/ # 50+ category-specific pairwise rubrics (.md)
│ ├── pointwise_prompts/ # Precise IF prompts
│ └── verifiable_prompts/ # Ground-truth verification prompts
│
├── data/ # Evaluation datasets
│ ├── judgebench/
│ ├── ppe/
│ ├── rewardbench_v2/
│ └── rmbench/
│
├── requirements.txt
└── LICENSE # Apache License 2.0
After evaluation, results are organized by verdict:
results/
├── all_results_{annotation}.jsonl # All results
├── verifiable_good_cases_{annotation}.jsonl # Verifiable: chosen wins
├── verifiable_bad_cases_{annotation}.jsonl # Verifiable: rejected wins
├── pairwise_good_cases_{annotation}.jsonl # Pairwise: chosen wins
├── pairwise_bad_cases_{annotation}.jsonl # Pairwise: rejected wins
├── pairwise_same_cases_{annotation}.jsonl # Pairwise: tie
├── error_cases_{annotation}.jsonl # Evaluation errors
└── summary_{annotation}.json # Aggregated statistics
If you find this work useful, please cite:
@misc{jia2026openrubricsystemscaling,
title={Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric},
author={Ruipeng Jia and Yunyi Yang and Yuxin Wu and Yongbo Gai and Siyuan Tao and Mengyu Zhou and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
year={2026},
eprint={2602.14069},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.14069},
}This project is licensed under the Apache License 2.0.