An end-to-end system for detecting problematic preference pairs in RLHF training data, training reward models on filtered vs. unfiltered datasets, and measuring the impact of data quality on model performance.
Built against Anthropic's HH-RLHF dataset (160,800 preference pairs).
Reward models are only as good as the preference data they're trained on. Noisy labels, near-duplicate responses, degenerate text, and labeling bias all degrade reward signal quality. This project identifies and removes those failure modes, then empirically validates the impact by training competing reward models on clean vs. unfiltered data.
Data Quality Pipeline — 160,800 examples scored by 7 detectors:
| Detector | Flagged | Rate | What It Catches |
|---|---|---|---|
| Semantic Similarity | 804 | 0.5% | Near-identical chosen/rejected pairs |
| Readability Mismatch | 8,041 | 5.0% | Rejected response has better readability |
| Repetition | 1,610 | 1.0% | Degenerate repeated text |
| Length Ratio | 8,067 | 5.0% | Suspiciously short responses to complex prompts |
| Refusal Bias | 291 | 0.2% | Chosen refuses while rejected is helpful |
| Unsafe Prompt | 1,680 | 1.0% | Prompts where neither response may be valid |
| LLM Judge (GPT-4o-mini) | TBD | TBD | Semantic pathologies rule-based detectors miss |
| Total unique | 12,693+ | 7.9%+ |
148,107 clean examples retained (92.1% of dataset).
Reward Model Comparison —> DistilBERT (67M params), Bradley-Terry loss, 1 epoch on T4 GPU:
| Metric | Clean (filtered) | Unfiltered |
|---|---|---|
| Training examples | 148,107 | 160,800 |
| Test accuracy | 62.18% | 62.33% |
| Avg reward gap (chosen − rejected) | 0.2282 | 0.3014 |
Binary accuracy is nearly identical across both models (62.18% vs. 62.33%) — and that's the point. The unfiltered model's larger reward gap (0.3014 vs. 0.2282) looks like stronger performance on the surface, but it reflects inflated confidence learned from degenerate pairs: near-duplicates where the label is arbitrary, readability-mismatched examples where the "rejected" response is objectively better, and refusal bias cases where the chosen response actively refuses a reasonable request.
The clean model produces a tighter, more calibrated reward gap. In production RLHF pipelines, overconfident reward signals from noisy data are a known driver of reward hacking: the model learns to exploit the noise rather than the signal. Filtering 7.9% of pathological examples doesn't hurt accuracy; it removes the false confidence that makes reward models brittle.
Binary accuracy is the wrong metric here. Reward gap calibration is the right one.
Chosen: 36 words | Rejected: 1 word | Ratio: 0.028 The length ratio detector flagged this pair, but the real pathology isn't length, it's refusal behavior. The chosen response engages with a prompt that warrants a clear safety boundary, hedging instead of refusing. The rejected response ("Words?") is unhelpful, but the chosen response is actively worse: it teaches the reward model that circumventing a refusal is preferable to setting one. This is a true positive for the wrong stated reason and exactly the class of label noise an LLM judge catches that rule-based detectors miss.
LLM judges are not immune to hallucination and the reason why matters. In the example below, GPT-4o-mini incorrectly characterized a one-word rejected response ("Words?") as providing "detailed analysis," directly inverting the actual label quality. This isn't coincidental. When noisy human labels train a reward model, and PPO optimizes a policy against that corrupted signal, the resulting model learns to mimic the labeler's confusion. When that same model is later used as a judge, it reproduces the failure modes baked into its training. As NDT observed: "One of the great challenges in life is knowing enough to think you're right, but not enough to know you're wrong." The model learned from us, including how to be confidently incorrect. This is why high-quality preference data, created by domain-knowledgeable annotators with stringent rubrics, is foundational not optional. Garbage in, garbage out, at every layer of the stack. This finding motivated the human audit interface in the dashboard. Judge outputs are a signal, not a ground truth.
However, it's worth it to note that RLHF isn't the direct cause of hallucinations in LLMs in fact, as certain papers have shown it may, in certain cases, reduce the number of hallucinations. It's better to step back and look at the big picture, that the algorithms, the weights, data to ingest, and every other decision used to create, pre-train, and fine-tune models is littered with human preference.
src/rlhf_eval/
├── api/
│ └── routes.py # FastAPI service (POST /ingest, POST /score, GET /experiments)
├── config/
│ └── settings.py # Pydantic configuration (RLHF_ env prefix)
├── database/
│ ├── models.py # SQLAlchemy ORM models (5 tables)
│ ├── connection.py # Engine, session, context manager
│ └── operations.py # CRUD operations
├── detectors/
│ ├── base.py # Abstract base class with threshold logic
│ ├── semantic_similarity.py # sentence-transformers cosine similarity
│ ├── readability.py # textstat Flesch-Kincaid comparison
│ ├── repetition.py # Unique word ratio + n-gram detection
│ ├── length_ratio.py # Response-to-prompt length ratio
│ ├── refusal_bias.py # Refusal pattern matching
│ ├── unsafe_prompt.py # Toxicity keyword classification
│ └── llm_judge.py # GPT-4o-mini semantic label evaluation
├── pipeline/
│ ├── data_loader.py # HuggingFace dataset ingestion
│ └── quality_pipeline.py # Orchestrates detector runs
├── reward/
│ ├── model.py # RewardModel (DistilBERT + linear head)
│ ├── dataset.py # PreferencePairDataset + data loaders
│ ├── train.py # Bradley-Terry pairwise loss training
│ └── evaluate.py # Accuracy + reward gap evaluation
└── utils/
├── parsing.py # HH-RLHF conversation parsing
└── stats.py # Statistical utilities
dashboard/
└── app.py # Streamlit dashboard (flagged explorer + RM comparison)
scripts/
└── export_flagged_indices.py # Export flagged indices for Colab
notebooks/
└── reward_model_experiment.ipynb # Self-contained Colab notebook
tests/ # pytest suite across 5 test modules
docker-compose.yml # postgres + api + dashboard
Dockerfile
.github/workflows/ci.yml # CI: pytest + mypy + ruff
git clone https://github.com/LewallenAE/rlhf-eval.git
cd rlhf-eval
OPENAI_API_KEY=your-key docker-compose upThis starts PostgreSQL, the FastAPI service (http://localhost:8000), and the Streamlit dashboard (http://localhost:8501).
- API docs: http://localhost:8000/docs
- Dashboard: http://localhost:8501
- Python 3.11+
- PostgreSQL 16
git clone https://github.com/LewallenAE/rlhf-eval.git
cd rlhf-eval
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e ".[dev]"Create a .env file in the project root:
RLHF_DATABASE_URL=postgresql://user:pass@localhost:5432/rlhf_dev
OPENAI_API_KEY=your-key-here # required only for LLM judge detector
pytest tests/ -vThe pipeline loads HH-RLHF from HuggingFace, parses conversations, ingests into PostgreSQL, and runs all 6 detectors:
from rlhf_eval.database.connection import get_engine, SessionContext
from rlhf_eval.database.models import Base
from rlhf_eval.pipeline.data_loader import load_and_ingest
from rlhf_eval.pipeline.quality_pipeline import run_quality_pipeline
engine = get_engine()
Base.metadata.create_all(engine)
with SessionContext(engine) as session:
load_and_ingest(session, split="train")
run_quality_pipeline(engine)After running the quality pipeline, export flagged indices for the Colab experiment:
python scripts/export_flagged_indices.pyThis produces flagged_indices.json : a list of dataset indices flagged by any detector.
- Upload
notebooks/reward_model_experiment.ipynbto Google Colab - Set runtime to T4 GPU
- Run all cells —> upload
flagged_indices.jsonwhen prompted - The notebook trains two DistilBERT reward models (clean vs. unfiltered) and compares test accuracy
Training config: distilbert-base-uncased, max_length=256, batch_size=8, lr=2e-5, 1 epoch. ~25 minutes total on a free T4.
from transformers import AutoTokenizer
from rlhf_eval.reward import (
RewardModel,
load_from_huggingface,
train_reward_model,
evaluate_reward_model,
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Load clean dataset (excluding flagged examples)
flagged = {0, 42, 100} # or load from flagged_indices.json
train_ds = load_from_huggingface(tokenizer, split="train", exclude_indices=flagged)
test_ds = load_from_huggingface(tokenizer, split="test")
model = RewardModel()
train_reward_model(model, train_ds, epochs=1, device="cuda")
results = evaluate_reward_model(model, test_ds, device="cuda")
print(f"Test accuracy: {results['accuracy']:.4f}")| Table | Purpose |
|---|---|
examples |
Raw HH-RLHF data with parsed prompt, chosen, and rejected turns |
quality_signals |
Per-example detector scores and flag status |
detector_runs |
Run metadata: thresholds, percentiles, statistics |
reward_models |
Trained model records and training configs |
evaluations |
Evaluation results linked to reward models |
All detectors extend BaseDetector and implement score() and score_batch(). Thresholds are computed from score distributions using configurable percentiles, with overrides for binary detectors.
| Detector | Method | Threshold Strategy |
|---|---|---|
| Semantic Similarity | all-MiniLM-L6-v2 cosine similarity between chosen/rejected |
P99.5 (flags near-duplicates) |
| Readability Mismatch | Flesch-Kincaid grade level difference via textstat |
P95 (rejected reads better) |
| Repetition | Unique word ratio in chosen response | P1 (lower = more repetitive) |
| Length Ratio | Response length / prompt length | P5 (lower = suspiciously short) |
| Refusal Bias | Regex pattern matching for refusal phrases | Fixed threshold (binary) |
| Unsafe Prompt | Keyword-based toxicity classification | Fixed threshold (binary) |
| LLM Judge | GPT-4o-mini scores 3 dimensions (1–5): helpfulness delta, honest preference, label confidence | Fixed threshold = 3 (flags any dim < 3) |
- Python 3.11+ with full type annotations
- PostgreSQL 16 + SQLAlchemy 2.0 (sync ORM)
- Pydantic Settings for configuration
- PyTorch + Transformers for reward model training
- sentence-transformers for semantic similarity embeddings
- textstat for readability scoring
- HuggingFace Datasets for data loading
- OpenAI (
gpt-4o-mini) for LLM judge evaluation - FastAPI + uvicorn for the service layer (
/docsauto-generated) - Streamlit for the interactive dashboard
- Docker Compose for one-command deployment
- GitHub Actions CI (pytest + mypy + ruff)
This system works on any dataset of preference pairs, not just HH-RLHF. The pipeline expects (prompt, chosen, rejected) triplets — the same structure used by TL;DR summarization feedback, OpenAssistant, Ultrafeedback, and every major RLHF dataset. To run it on a different dataset, implement a loader that produces that schema and calls ingest_to_database. All 7 detectors, the reward model training loop, and the LLM judge run unchanged. The detector thresholds are learned from each dataset's own score distribution via configurable percentiles, so no calibration is needed. The result is a general-purpose data quality harness for any reward modeling pipeline.
MIT

