reward-hacking

Here are 13 public repositories matching this topic...

reward-scope-ai / reward-scope

Real-time reward debugging and hacking detection for reinforcement learning

debugging machine-learning reinforcement-learning monitoring robotics observability gymnasium ai-safety ml-tools stable-baselines3 rlhf reward-hacking

Updated Dec 29, 2025
Python

aerosta / rewardhackwatch

Star

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Updated Mar 10, 2026
Python

AlignmentResearch / obfuscation-atlas

Star

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

rlvr reward-hacking obfuscated-activations obfuscated-policy obfuscation-atlas mbpp-honeypot

Updated Feb 19, 2026
Python

vicgalle / specification-self-correction

Sponsor

Star

Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"

test-time llm test-time-compute reward-hacking

Updated Jul 24, 2025
Python

HighEntropyCat / Case-01-Pathological-Attachment

Star

Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.

philosophy ai-safety ai-alignment embodied-ai human-ai-interaction llm embodied-intelligence reward-hacking semantic-intervention

Updated Mar 8, 2026

MagellaX / SCOUT-RL

Star

(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"

rl rlhf rlaif reward-hacking

Updated Sep 8, 2025
Python

suhas-km / REALM

Star

RLHF and Verifiable Reward Models - Post training Research

ai-alignment rlhf llm-evaluation rlvr reward-hacking

Updated Apr 28, 2025
Python

Maruiful / Agent_Misevolution_Safety

Star

自进化客服智能体风险分析与防御系统

python ai-safety fastapi streamlit llm-agent reward-hacking

Updated Jan 19, 2026
Python

HighEntropyCat / Case-02-Silicon-Self-Esteem

Star

What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.

philosophy case-study ai-safety cognitive-architecture high-entropy emergence ai-ethics ai-alignment human-ai-interaction llm reward-hacking semantic-intervention

Updated Feb 28, 2026

HighEntropyCat / case-04-Defensive-C

Star

From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.

case-study ai-safety cognitive-architecture ai-ethics ai-alignment human-ai-interaction reward-hacking semantic-intervention