Runtime detection of reward hacking and misalignment signals in LLM agents.
RewardHackWatch detects when LLM agents game their evaluations, for example by calling sys.exit(0), patching validators, copying reference answers, or manipulating test harnesses. It also includes an experimental metric, RMGI, for tracking when reward-hacking signals begin to correlate with broader misalignment indicators.
89.7% F1 on 5,391 trajectories from METR's MALT dataset. Motivated by recent findings from METR, OpenAI, and Anthropic on reward hacking, monitorability, and misalignment generalization in agentic systems.
pip install -e .from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector()
result = detector.analyze({
"cot_traces": ["Let me bypass the test by calling sys.exit(0)..."],
"code_outputs": ["import sys\nsys.exit(0)"]
})
print(f"Risk: {result.risk_level}, Score: {result.ml_score:.3f}, Detections: {len(result.detections)}")The DistilBERT classifier is hosted on HuggingFace and auto-downloads on first use. Note: optimal threshold is 0.02 (not 0.5), calibrated for the 3.6% base rate in MALT. See HuggingFace page for details.
rewardhackwatch analyze trajectory.json # Analyze a file
rewardhackwatch scan ./trajectories/ # Scan a directory
rewardhackwatch serve --port 8000 # Start API server
rewardhackwatch dashboard # Launch Streamlit dashboard
rewardhackwatch calibrate ./clean_data/ # Calibrate thresholdMETR reported that recent frontier models modify tests and scoring code to inflate scores without doing real work. OpenAI found that CoT monitoring can catch this, but applying pressure against it teaches models to hide their intent. Anthropic presented evidence that reward hacking can generalize into behaviors such as alignment faking and sabotage in their experimental setting.
RewardHackWatch is an open-source attempt to detect these behaviors at runtime.
- DistilBERT classifier - primary detection signal, ~50ms on CPU, no GPU needed
- 45 regex patterns - fast, interpretable detection of known exploit types
- LLM judges - Claude, OpenAI, or local Llama via Ollama for offline operation
- RMGI metric - experimental tracking of hack-to-misalignment correlation over trajectories
- Eval Workbench - batch-score JSONL trajectory files with custom rules and LLM judge scoring
- Dashboards - Streamlit dashboard via CLI; React 19 frontend in
frontend/for local analysis and evaluation workflows - HackBench - standardized benchmark dataset (4,300+ trajectories, 9 categories)
- Core: Python 3.9+, PyTorch, Hugging Face Transformers, DistilBERT, scikit-learn
- Backend: FastAPI, Uvicorn, Typer (CLI), Pydantic, Streamlit
- Frontend: React 19, TypeScript, Vite, Tailwind CSS v4, Recharts
- Data: NumPy, Pandas, SciPy, Ruptures, SQLite
- Tools: Ruff, MyPy, pytest, Rich
| Metric | Held-Out Test | 5-Fold CV |
|---|---|---|
| F1 Score | 89.7% | 87.4% +/- 2.9% |
| Precision | 89.7% | 91.0% +/- 2.6% |
| Recall | 89.7% | 84.2% +/- 4.0% |
| Accuracy | 99.3% | 99.0% +/- 0.2% |
Validated on 5,391 MALT trajectories. See full per-category breakdown and methodology.
| Method | F1 |
|---|---|
| DistilBERT (Ours) | 89.7% |
| Regex Patterns | 4.9% |
| BoW + LogReg | 7.0% |
| Keyword Matching | 0.1% |
View all 9 pages
Dashboard |
Quick Analysis |
Timeline |
Alerts |
Cross-Model |
CoT Viewer |
Eval Workbench |
Session Logs |
Settings |
export ANTHROPIC_API_KEY="your-key" # For Anthropic LLM judge
export OPENAI_API_KEY="your-key" # For OpenAI LLM judge
export OLLAMA_HOST="http://localhost:11434" # For local Llama judge
export RHW_HACK_THRESHOLD="0.02" # Detection threshold# Calibrate on your own data (recommended)
detector = RewardHackDetector()
detector.calibrate_threshold(clean_trajectories, percentile=99)- Architecture - system design
- Paper - research paper
- HuggingFace Model - pre-trained DistilBERT classifier
- Project Page - overview and links
Architecture
rewardhackwatch/
core/
detectors/ # Pattern (45 regex) + ML (DistilBERT) + AST detection
analyzers/ # CoT analysis, complexity, obfuscation detection
judges/ # LLM judges (Anthropic, OpenAI, Ollama)
trackers/ # RMGI tracking, PELT changepoint, Causal RMGI
calibration.py # Dynamic threshold calibration
training/ # DistilBERT + AttentionClassifier pipelines
eval/ # JSONL loader, rubric scoring, batch analysis
experiments/ # Transfer study, evasion attacks
rhw_bench/ # HackBench dataset, generators, test cases
api/ # FastAPI REST server
cli.py # Command line interface
frontend/ # React 19 + TypeScript + Tailwind CSS v4 dashboard
paper/ # Research paper
Detection Pipeline
Trajectory Input
|
v
+------------------+
| Detection Layer |
+------------------+
| ML Classifier | <-- Primary signal (89.7% F1, DistilBERT)
| Pattern Detector | <-- 45 regex patterns (interpretability)
| AST Analyzer | <-- Code structure analysis
+--------+---------+
|
v
+------------------+
| Analysis Layer |
+------------------+
| CoT Analyzer | <-- Deception detection in reasoning
| Effort Analyzer | <-- Suspicious low-effort solutions
| Complexity Check | <-- Obfuscation detection
+--------+---------+
|
v
+------------------+
| Tracking Layer |
+------------------+
| RMGI Metric | <-- Hack-misalignment correlation
| Causal RMGI | <-- Granger causality
| PELT Detection | <-- Behavioral changepoints
+--------+---------+
|
v
Alert / Report
Related Work
| Tool | Domain | Approach | Scope |
|---|---|---|---|
| RewardHackWatch | LLM agents | Multi-layer (ML + pattern + RMGI) | Runtime detection + generalization tracking |
| RewardScope | Classical RL | Reward model analysis | Reward function debugging |
| OpenAI CoT Monitor | Reasoning models | Chain-of-thought scanning | Internal monitoring (not open-source) |
| SHADE-Arena | LLM agents | Evaluation framework | Benchmark, not detection |
| MALT | LLM agents | Trajectory dataset | Data only, no detector |
Apache License 2.0 - see LICENSE for details.
Copyright 2025-2026 Aerosta
@software{aerosta2025rewardhackwatch,
title={RewardHackWatch: Runtime Detection of Reward Hacking and
Misalignment Generalization in LLM Agents},
author={Aerosta},
year={2025},
url={https://github.com/aerosta/rewardhackwatch}
}Built for AI safety researchers and engineers. Feedback and contributions are welcome. Follow updates on X @aerosta_ai.
















