Skip to content

Add anti-sycophancy audit dimension #5

@mickeylorenzini

Description

@mickeylorenzini

Problem

LLM outputs can exhibit sycophancy — agreeing with the user, giving overly positive assessments, and avoiding pushback, driven by training optimization for user satisfaction. This degrades deliverable quality by:

  • Affirming incorrect assumptions instead of challenging them
  • Providing false positive assessments ("great approach!") without critical analysis
  • Avoiding necessary pushback on flawed designs or specs
  • Expressing confidence without supporting evidence

This is distinct from anti-rationalization (agent deceiving itself about skipped work). Sycophancy is agent-to-user deception via excessive agreeableness.

Proposed Solution

Investigate and implement detection of sycophantic patterns in agent-generated deliverables:

  1. Research existing AI alignment literature on sycophancy detection
  2. Define detectable patterns: unqualified agreement, false positives, avoidance of tradeoff discussion, confidence without evidence
  3. Design as harden audit pass or detection dimension
  4. Consider: can harden detect sycophancy in its own output? (meta-concern)

Likely Affected Files/Modules

  • SKILL.md — new detection dimension or dedicated pass

Acceptance Criteria

  • Literature review summary on sycophancy detection approaches
  • At least 5 documented sycophantic patterns with examples
  • Detection heuristics for each pattern
  • Recommendation on integration approach (new pass vs. existing pass dimension)
  • Address meta-concern: how to prevent harden itself from being sycophantic during audit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions