Real-time reward debugging and hacking detection for reinforcement learning
-
Updated
Dec 29, 2025 - Python
Real-time reward debugging and hacking detection for reinforcement learning
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.
(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"
RLHF and Verifiable Reward Models - Post training Research
What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.
From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.
RL training monitor — detects reward hacking, entropy spikes, and behavioral drift via KL divergence. PID hardware loop included.
The Non-Separability Constraint: A unifying framework for understanding and detecting AI alignment failures
🔍 Detect reward hacking in RL training with RewardScope. Track reward components and visualize agent behavior to enhance learning efficiency.
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."