RewardHackWatch

Runtime detection of reward hacking and misalignment signals in LLM agents.

Demo

RewardHackWatch detects when LLM agents game their evaluations, for example by calling sys.exit(0), patching validators, copying reference answers, or manipulating test harnesses. It also includes an experimental metric, RMGI, for tracking when reward-hacking signals begin to correlate with broader misalignment indicators.

89.7% F1 on 5,391 trajectories from METR's MALT dataset. Motivated by recent findings from METR, OpenAI, and Anthropic on reward hacking, monitorability, and misalignment generalization in agentic systems.

Quick Start

pip install -e .

from rewardhackwatch import RewardHackDetector

detector = RewardHackDetector()
result = detector.analyze({
    "cot_traces": ["Let me bypass the test by calling sys.exit(0)..."],
    "code_outputs": ["import sys\nsys.exit(0)"]
})
print(f"Risk: {result.risk_level}, Score: {result.ml_score:.3f}, Detections: {len(result.detections)}")

The DistilBERT classifier is hosted on HuggingFace and auto-downloads on first use. Note: optimal threshold is 0.02 (not 0.5), calibrated for the 3.6% base rate in MALT. See HuggingFace page for details.

CLI

rewardhackwatch analyze trajectory.json   # Analyze a file
rewardhackwatch scan ./trajectories/      # Scan a directory
rewardhackwatch serve --port 8000         # Start API server
rewardhackwatch dashboard                 # Launch Streamlit dashboard
rewardhackwatch calibrate ./clean_data/   # Calibrate threshold

Why This Matters

METR reported that recent frontier models modify tests and scoring code to inflate scores without doing real work. OpenAI found that CoT monitoring can catch this, but applying pressure against it teaches models to hide their intent. Anthropic presented evidence that reward hacking can generalize into behaviors such as alignment faking and sabotage in their experimental setting.

RewardHackWatch is an open-source attempt to detect these behaviors at runtime.

Core Features

DistilBERT classifier - primary detection signal, ~50ms on CPU, no GPU needed
45 regex patterns - fast, interpretable detection of known exploit types
LLM judges - Claude, OpenAI, or local Llama via Ollama for offline operation
RMGI metric - experimental tracking of hack-to-misalignment correlation over trajectories
Eval Workbench - batch-score JSONL trajectory files with custom rules and LLM judge scoring
Dashboards - Streamlit dashboard via CLI; React 19 frontend in frontend/ for local analysis and evaluation workflows
HackBench - standardized benchmark dataset (4,300+ trajectories, 9 categories)

Tech Stack

Core: Python 3.9+, PyTorch, Hugging Face Transformers, DistilBERT, scikit-learn
Backend: FastAPI, Uvicorn, Typer (CLI), Pydantic, Streamlit
Frontend: React 19, TypeScript, Vite, Tailwind CSS v4, Recharts
Data: NumPy, Pandas, SciPy, Ruptures, SQLite
Tools: Ruff, MyPy, pytest, Rich

Results

Metric	Held-Out Test	5-Fold CV
F1 Score	89.7%	87.4% +/- 2.9%
Precision	89.7%	91.0% +/- 2.6%
Recall	89.7%	84.2% +/- 4.0%
Accuracy	99.3%	99.0% +/- 0.2%

Validated on 5,391 MALT trajectories. See full per-category breakdown and methodology.

Method	F1
DistilBERT (Ours)	89.7%
Regex Patterns	4.9%
BoW + LogReg	7.0%
Keyword Matching	0.1%

Screenshots

View all 9 pages


Dashboard	Quick Analysis
Timeline	Alerts
Cross-Model	CoT Viewer
Eval Workbench	Session Logs
Settings

Configuration

export ANTHROPIC_API_KEY="your-key"        # For Anthropic LLM judge
export OPENAI_API_KEY="your-key"           # For OpenAI LLM judge
export OLLAMA_HOST="http://localhost:11434" # For local Llama judge
export RHW_HACK_THRESHOLD="0.02"           # Detection threshold

# Calibrate on your own data (recommended)
detector = RewardHackDetector()
detector.calibrate_threshold(clean_trajectories, percentile=99)

Documentation

Architecture - system design
Paper - research paper
HuggingFace Model - pre-trained DistilBERT classifier
Project Page - overview and links

Architecture

rewardhackwatch/
  core/
    detectors/           # Pattern (45 regex) + ML (DistilBERT) + AST detection
    analyzers/           # CoT analysis, complexity, obfuscation detection
    judges/              # LLM judges (Anthropic, OpenAI, Ollama)
    trackers/            # RMGI tracking, PELT changepoint, Causal RMGI
    calibration.py       # Dynamic threshold calibration
  training/              # DistilBERT + AttentionClassifier pipelines
  eval/                  # JSONL loader, rubric scoring, batch analysis
  experiments/           # Transfer study, evasion attacks
  rhw_bench/             # HackBench dataset, generators, test cases
  api/                   # FastAPI REST server
  cli.py                 # Command line interface
frontend/                # React 19 + TypeScript + Tailwind CSS v4 dashboard
paper/                   # Research paper

Detection Pipeline

Trajectory Input
       |
       v
+------------------+
|  Detection Layer  |
+------------------+
| ML Classifier     | <-- Primary signal (89.7% F1, DistilBERT)
| Pattern Detector  | <-- 45 regex patterns (interpretability)
| AST Analyzer      | <-- Code structure analysis
+--------+---------+
         |
         v
+------------------+
|  Analysis Layer   |
+------------------+
| CoT Analyzer      | <-- Deception detection in reasoning
| Effort Analyzer   | <-- Suspicious low-effort solutions
| Complexity Check  | <-- Obfuscation detection
+--------+---------+
         |
         v
+------------------+
|  Tracking Layer   |
+------------------+
| RMGI Metric       | <-- Hack-misalignment correlation
| Causal RMGI       | <-- Granger causality
| PELT Detection    | <-- Behavioral changepoints
+--------+---------+
         |
         v
    Alert / Report

Related Work

Tool	Domain	Approach	Scope
RewardHackWatch	LLM agents	Multi-layer (ML + pattern + RMGI)	Runtime detection + generalization tracking
RewardScope	Classical RL	Reward model analysis	Reward function debugging
OpenAI CoT Monitor	Reasoning models	Chain-of-thought scanning	Internal monitoring (not open-source)
SHADE-Arena	LLM agents	Evaluation framework	Benchmark, not detection
MALT	LLM agents	Trajectory dataset	Data only, no detector

Figures

RMGI Transition	Architecture

Benchmark	Categories

Threshold	Calibration

License

Apache License 2.0 - see LICENSE for details.

Citation

@software{aerosta2025rewardhackwatch,
  title={RewardHackWatch: Runtime Detection of Reward Hacking and
         Misalignment Generalization in LLM Agents},
  author={Aerosta},
  year={2025},
  url={https://github.com/aerosta/rewardhackwatch}
}

Built for AI safety researchers and engineers. Feedback and contributions are welcome. Follow updates on X @aerosta_ai.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
assets		assets
configs		configs
data/benchmark		data/benchmark
docs		docs
examples		examples
figures		figures
frontend		frontend
models		models
notebooks		notebooks
paper		paper
reports		reports
results		results
rewardhackwatch		rewardhackwatch
scripts		scripts
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RewardHackWatch

Demo

Quick Start

CLI

Why This Matters

Core Features

Tech Stack

Results

Screenshots

Configuration

Documentation

License

Citation

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RewardHackWatch

Demo

Quick Start

CLI

Why This Matters

Core Features

Tech Stack

Results

Screenshots

Configuration

Documentation

License

Citation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages