Problem
The ECS task can look healthy while the API is unreachable from the edge. We just hit a case where vps.goosebumps.fm returned 503 even though ECS reported the service as running.
Proposal
Add an external health check that alarms on real user-facing availability.
Recommended implementation:
- Use a CloudWatch Synthetics canary against
https://vps.goosebumps.fm/health
- Run it every minute
- Alarm on failed runs or degraded success rate
- Notify via SNS so the alarm can page/email/Slack
Acceptance criteria
- A failing
/health response creates a CloudWatch alarm
- Alarm notifications are visible to the team
- The check reflects edge reachability, not just ECS task health
Problem
The ECS task can look healthy while the API is unreachable from the edge. We just hit a case where
vps.goosebumps.fmreturned 503 even though ECS reported the service as running.Proposal
Add an external health check that alarms on real user-facing availability.
Recommended implementation:
https://vps.goosebumps.fm/healthAcceptance criteria
/healthresponse creates a CloudWatch alarm