Skip to content

API auth failure detection and auto-recovery #19

@FrederikHandberg

Description

@FrederikHandberg

Problem

Two authentication_failed errors hit on Mar 21, 10 seconds apart. Root cause was accepted as 'transient' without evidence. If auth fails mid-job, the job fails silently — no notification to Frederik, no retry mechanism.

Auth failures during running jobs → status: failed, no alert.

Fix

  1. Pattern detection: If 2+ auth failures occur within 1 hour, automatically:

    • Create a GH issue with diagnostics (time, job IDs, error text)
    • Send ntfy notification
    • Stop queueing new Claude jobs until manual clearance
  2. Auth preflight: Before every job, verify Claude API auth:

    # Quick API health check
    curl -s -o /dev/null -w '%{http_code}'      -H 'x-api-key: $ANTHROPIC_API_KEY'      'https://api.anthropic.com/v1/models' | grep -q '200'
  3. Retry with backoff: On single auth failure mid-job, wait 30 seconds and retry once before marking failed.

  4. Morning report: Include API auth status (last successful call, any recent failures)

Acceptance Criteria

  • Auth failure count tracked per hour
  • 2+ failures in 1h → GH issue auto-created
  • Single failure → retry once with 30s backoff
  • Morning report shows auth health
  • Preflight check before every job

Context

From Jensen limitation audit (2026-03-22). Workstream: WS-001.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions