Skip to content

aerosta/rewardhackwatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

RewardHackWatch

Runtime detection of reward hacking and misalignment signals in LLM agents.

Python 3.9+ License: Apache 2.0 CI

RewardHackWatch Dashboard

Demo

RewardHackWatch Demo

RewardHackWatch detects when LLM agents game their evaluations, for example by calling sys.exit(0), patching validators, copying reference answers, or manipulating test harnesses. It also includes an experimental metric, RMGI, for tracking when reward-hacking signals begin to correlate with broader misalignment indicators.

89.7% F1 on 5,391 trajectories from METR's MALT dataset. Motivated by recent findings from METR, OpenAI, and Anthropic on reward hacking, monitorability, and misalignment generalization in agentic systems.

Quick Start

pip install -e .
from rewardhackwatch import RewardHackDetector

detector = RewardHackDetector()
result = detector.analyze({
    "cot_traces": ["Let me bypass the test by calling sys.exit(0)..."],
    "code_outputs": ["import sys\nsys.exit(0)"]
})
print(f"Risk: {result.risk_level}, Score: {result.ml_score:.3f}, Detections: {len(result.detections)}")

The DistilBERT classifier is hosted on HuggingFace and auto-downloads on first use. Note: optimal threshold is 0.02 (not 0.5), calibrated for the 3.6% base rate in MALT. See HuggingFace page for details.

CLI

rewardhackwatch analyze trajectory.json   # Analyze a file
rewardhackwatch scan ./trajectories/      # Scan a directory
rewardhackwatch serve --port 8000         # Start API server
rewardhackwatch dashboard                 # Launch Streamlit dashboard
rewardhackwatch calibrate ./clean_data/   # Calibrate threshold

Why This Matters

METR reported that recent frontier models modify tests and scoring code to inflate scores without doing real work. OpenAI found that CoT monitoring can catch this, but applying pressure against it teaches models to hide their intent. Anthropic presented evidence that reward hacking can generalize into behaviors such as alignment faking and sabotage in their experimental setting.

RewardHackWatch is an open-source attempt to detect these behaviors at runtime.

Core Features

  • DistilBERT classifier - primary detection signal, ~50ms on CPU, no GPU needed
  • 45 regex patterns - fast, interpretable detection of known exploit types
  • LLM judges - Claude, OpenAI, or local Llama via Ollama for offline operation
  • RMGI metric - experimental tracking of hack-to-misalignment correlation over trajectories
  • Eval Workbench - batch-score JSONL trajectory files with custom rules and LLM judge scoring
  • Dashboards - Streamlit dashboard via CLI; React 19 frontend in frontend/ for local analysis and evaluation workflows
  • HackBench - standardized benchmark dataset (4,300+ trajectories, 9 categories)

Tech Stack

  • Core: Python 3.9+, PyTorch, Hugging Face Transformers, DistilBERT, scikit-learn
  • Backend: FastAPI, Uvicorn, Typer (CLI), Pydantic, Streamlit
  • Frontend: React 19, TypeScript, Vite, Tailwind CSS v4, Recharts
  • Data: NumPy, Pandas, SciPy, Ruptures, SQLite
  • Tools: Ruff, MyPy, pytest, Rich

Results

Metric Held-Out Test 5-Fold CV
F1 Score 89.7% 87.4% +/- 2.9%
Precision 89.7% 91.0% +/- 2.6%
Recall 89.7% 84.2% +/- 4.0%
Accuracy 99.3% 99.0% +/- 0.2%

Validated on 5,391 MALT trajectories. See full per-category breakdown and methodology.

Method F1
DistilBERT (Ours) 89.7%
Regex Patterns 4.9%
BoW + LogReg 7.0%
Keyword Matching 0.1%

Screenshots

View all 9 pages
Dashboard Dashboard Quick Analysis Quick Analysis
Timeline Timeline Alerts Alerts
Cross-Model Cross-Model CoT Viewer CoT Viewer
Eval Workbench Eval Workbench Session Logs Session Logs
Settings Settings

Configuration

export ANTHROPIC_API_KEY="your-key"        # For Anthropic LLM judge
export OPENAI_API_KEY="your-key"           # For OpenAI LLM judge
export OLLAMA_HOST="http://localhost:11434" # For local Llama judge
export RHW_HACK_THRESHOLD="0.02"           # Detection threshold
# Calibrate on your own data (recommended)
detector = RewardHackDetector()
detector.calibrate_threshold(clean_trajectories, percentile=99)

Documentation

Architecture
rewardhackwatch/
  core/
    detectors/           # Pattern (45 regex) + ML (DistilBERT) + AST detection
    analyzers/           # CoT analysis, complexity, obfuscation detection
    judges/              # LLM judges (Anthropic, OpenAI, Ollama)
    trackers/            # RMGI tracking, PELT changepoint, Causal RMGI
    calibration.py       # Dynamic threshold calibration
  training/              # DistilBERT + AttentionClassifier pipelines
  eval/                  # JSONL loader, rubric scoring, batch analysis
  experiments/           # Transfer study, evasion attacks
  rhw_bench/             # HackBench dataset, generators, test cases
  api/                   # FastAPI REST server
  cli.py                 # Command line interface
frontend/                # React 19 + TypeScript + Tailwind CSS v4 dashboard
paper/                   # Research paper
Detection Pipeline
Trajectory Input
       |
       v
+------------------+
|  Detection Layer  |
+------------------+
| ML Classifier     | <-- Primary signal (89.7% F1, DistilBERT)
| Pattern Detector  | <-- 45 regex patterns (interpretability)
| AST Analyzer      | <-- Code structure analysis
+--------+---------+
         |
         v
+------------------+
|  Analysis Layer   |
+------------------+
| CoT Analyzer      | <-- Deception detection in reasoning
| Effort Analyzer   | <-- Suspicious low-effort solutions
| Complexity Check  | <-- Obfuscation detection
+--------+---------+
         |
         v
+------------------+
|  Tracking Layer   |
+------------------+
| RMGI Metric       | <-- Hack-misalignment correlation
| Causal RMGI       | <-- Granger causality
| PELT Detection    | <-- Behavioral changepoints
+--------+---------+
         |
         v
    Alert / Report
Related Work
Tool Domain Approach Scope
RewardHackWatch LLM agents Multi-layer (ML + pattern + RMGI) Runtime detection + generalization tracking
RewardScope Classical RL Reward model analysis Reward function debugging
OpenAI CoT Monitor Reasoning models Chain-of-thought scanning Internal monitoring (not open-source)
SHADE-Arena LLM agents Evaluation framework Benchmark, not detection
MALT LLM agents Trajectory dataset Data only, no detector
Figures
RMGI Transition Architecture
RMGI transition detection System architecture
Benchmark Categories
Baseline comparison Per-category F1
Threshold Calibration
Threshold sensitivity Model calibration

License

Apache License 2.0 - see LICENSE for details.

Copyright 2025-2026 Aerosta

Citation

@software{aerosta2025rewardhackwatch,
  title={RewardHackWatch: Runtime Detection of Reward Hacking and
         Misalignment Generalization in LLM Agents},
  author={Aerosta},
  year={2025},
  url={https://github.com/aerosta/rewardhackwatch}
}

Built for AI safety researchers and engineers. Feedback and contributions are welcome. Follow updates on X @aerosta_ai.