-
Notifications
You must be signed in to change notification settings - Fork 69
API auth failure detection and auto-recovery #19
Copy link
Copy link
Open
Description
Problem
Two authentication_failed errors hit on Mar 21, 10 seconds apart. Root cause was accepted as 'transient' without evidence. If auth fails mid-job, the job fails silently — no notification to Frederik, no retry mechanism.
Auth failures during running jobs → status: failed, no alert.
Fix
-
Pattern detection: If 2+ auth failures occur within 1 hour, automatically:
- Create a GH issue with diagnostics (time, job IDs, error text)
- Send ntfy notification
- Stop queueing new Claude jobs until manual clearance
-
Auth preflight: Before every job, verify Claude API auth:
# Quick API health check curl -s -o /dev/null -w '%{http_code}' -H 'x-api-key: $ANTHROPIC_API_KEY' 'https://api.anthropic.com/v1/models' | grep -q '200'
-
Retry with backoff: On single auth failure mid-job, wait 30 seconds and retry once before marking failed.
-
Morning report: Include API auth status (last successful call, any recent failures)
Acceptance Criteria
- Auth failure count tracked per hour
- 2+ failures in 1h → GH issue auto-created
- Single failure → retry once with 30s backoff
- Morning report shows auth health
- Preflight check before every job
Context
From Jensen limitation audit (2026-03-22). Workstream: WS-001.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels