Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Say Goodbye to Scalar Reward Model, Exploring the Aha Moment in General RL

Introduction

OpenRS (Open Rubric System) is an LLM-as-a-Judge evaluation framework that replaces traditional Reward Models with adaptive, fine-grained rubric-based evaluation. The core idea is to use large language models to evaluate response quality through adaptive, query-type-specific rubrics — enabling multi-dimensional scoring with interpretable verdicts.

Figure 1: OpenRS Evaluation Pipeline — From pairwise responses, through verifiable and adaptive rubric generation, to multi-criteria scoring.

Key Features

🎯 Open Rubric: 50+ query-type-specific rubrics with weighted criteria (critical / core / important / highlight)
⚖️ Bi-directional Debiasing: Swaps A/B order to eliminate position bias
🔍 Critical Flaw Veto: Fatal errors override all other scoring dimensions
📊 4 Benchmarks: JudgeBench, PPE, RewardBench V2, RMBench

Main Results

We evaluate five judge models across four benchmarks:

Table 1: Accuracy (%) of different judge models across four benchmarks.

Installation

git clone https://github.com/Qwen-Applications/OpenRS.git
cd OpenRS
pip install -r requirements.txt

Quick Start

1. Configure API

OpenRS is compatible with any OpenAI-compatible inference backend (vLLM, SGLang, Ollama, etc.):

export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL_NAME="your-model-name"

2. Run Evaluation

JudgeBench / PPE

python judgebench_and_ppe.py \
    --input data/judgebench/gpt.jsonl \
    --output-dir results/judgebench \
    --annotation judgebench_gpt \
    --workers 50

RewardBench V2

python rewardbench_v2.py \
    --input data/rewardbench_v2/rewardbench_v2.jsonl \
    --output-dir results/rewardbench_v2 \
    --annotation rbv2 \
    --workers 10

RMBench

python rmbench.py \
    --input data/rmbench/rmbench.jsonl \
    --output results/rmbench_results.jsonl \
    --workers 10

Common Arguments

Argument	Description	Default
`--input`	Input data path	required
`--output-dir`	Output directory	`./results`
`--workers`	Concurrent threads	10
`--temperature`	Generation temperature	0.0
`--limit`	Max items to process (0=all)	0
`--no-resume`	Disable checkpoint resume	False
`--stats-only`	Report stats without running	False

Project Structure

OpenRS/
├── tools.py                    # API calls, JSON parsing, file I/O
├── evaluator.py                # Core evaluation interface (evaluate_pair)
├── evaluator_precise_if.py     # Precise IF (Instruction Following) evaluator
├── robust_utils.py             # Robustness utilities (Unicode/JSON tolerance)
│
├── judgebench_and_ppe.py       # JudgeBench / PPE evaluation script
├── rewardbench_v2.py           # RewardBench V2 evaluation script
├── rmbench.py                  # RMBench evaluation script
│
├── prompts/
│   ├── pairwise_prompts/       # 50+ category-specific pairwise rubrics (.md)
│   ├── pointwise_prompts/      # Precise IF prompts
│   └── verifiable_prompts/     # Ground-truth verification prompts
│
├── data/                       # Evaluation datasets
│   ├── judgebench/
│   ├── ppe/
│   ├── rewardbench_v2/
│   └── rmbench/
│
├── requirements.txt
└── LICENSE                     # Apache License 2.0

Output Format

After evaluation, results are organized by verdict:

results/
├── all_results_{annotation}.jsonl             # All results
├── verifiable_good_cases_{annotation}.jsonl   # Verifiable: chosen wins
├── verifiable_bad_cases_{annotation}.jsonl    # Verifiable: rejected wins
├── pairwise_good_cases_{annotation}.jsonl     # Pairwise: chosen wins
├── pairwise_bad_cases_{annotation}.jsonl      # Pairwise: rejected wins
├── pairwise_same_cases_{annotation}.jsonl     # Pairwise: tie
├── error_cases_{annotation}.jsonl             # Evaluation errors
└── summary_{annotation}.json                  # Aggregated statistics

Citation

If you find this work useful, please cite:

@misc{jia2026openrubricsystemscaling,
      title={Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric}, 
      author={Ruipeng Jia and Yunyi Yang and Yuxin Wu and Yongbo Gai and Siyuan Tao and Mengyu Zhou and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2602.14069},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14069}, 
}

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Introduction

Key Features

Main Results

Installation

Quick Start

1. Configure API

2. Run Evaluation

Common Arguments

Project Structure

Output Format

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assests		assests
data		data
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
evaluator.py		evaluator.py
evaluator_precise_if.py		evaluator_precise_if.py
judgebench_and_ppe.py		judgebench_and_ppe.py
requirements.txt		requirements.txt
rewardbench_v2.py		rewardbench_v2.py
rmbench.py		rmbench.py
robust_utils.py		robust_utils.py
tools.py		tools.py

Folders and files

Latest commit

History

Repository files navigation

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Introduction

Key Features

Main Results

Installation

Quick Start

1. Configure API

2. Run Evaluation

Common Arguments

Project Structure

Output Format

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages