Skip to content

Cap health check backoff and add Lambda deadline guard#93

Merged
pcholakov merged 1 commit into
mainfrom
pavel/healthcheck-deadline-guard
May 6, 2026
Merged

Cap health check backoff and add Lambda deadline guard#93
pcholakov merged 1 commit into
mainfrom
pavel/healthcheck-deadline-guard

Conversation

@pcholakov
Copy link
Copy Markdown
Collaborator

Summary

Closes #91. ServiceDeployer's exponential-backoff health check could run for ~35 minutes when the Restate admin endpoint was unreachable - well past Lambda's 15-minute hard limit. The Lambda would be killed mid-sleep without sending a CloudFormation response, leaving CFN to wait its 60-minute step timeout before failing the stack.

The schedule is 2 ** attempt * 1_000 ms plus 0-2 s jitter, capped only by MAX_HEALTH_CHECK_ATTEMPTS = 10 (which was bumped from 5 in 1.5.0). The last two sleeps alone were ~8.5 min and ~17 min.

Changes

  • Cap each iteration's backoff at 20 s so the worst-case loop stays under the 5-minute default Lambda timeout (~216 s vs 300 s budget) without changing the deployer Lambda's timeout or anyone's CFN template.
  • Add a deadline guard using context.getRemainingTimeInMillis() that aborts the loop with a clear error if the remaining budget can't cover the next request plus 60 s reserved for the registration loop, optional pruning, and the CFN response submission.
  • Expose healthCheckRetryAttempts and healthCheckMaxBackoff on ServiceRegistrationProps for users who need to tune the loop. Defaults are emitted only when the caller sets them explicitly, so existing CFN templates see no property diff.

Test plan

  • npm run build clean
  • npm run test all 11 pass (1 snapshot updated for the registration handler asset hash, expected since the Lambda code changed)
  • New tests verify the new props are forwarded when set and absent when unset
  • Existing snapshots unchanged (template stability)
  • Manual: deploy with an unreachable adminUrl and verify the deployer Lambda errors inside its own timeout instead of CFN sitting at the resource for 60 minutes

Notes

The registration loop (MAX_REGISTRATION_ATTEMPTS = 3, max ~56 s of sleeps) is unchanged - it's already short and the new 60 s reserve in the deadline guard accounts for it.

@pcholakov pcholakov force-pushed the pavel/healthcheck-deadline-guard branch from c03a7ef to a35daa6 Compare May 6, 2026 17:46
@pcholakov pcholakov marked this pull request as ready for review May 6, 2026 17:53
@pcholakov pcholakov force-pushed the pavel/healthcheck-deadline-guard branch 3 times, most recently from b4aa5a4 to 799cc37 Compare May 6, 2026 18:08
The exponential backoff in the ServiceDeployer custom resource handler
grew without a cap, so a sustained admin reachability failure could keep
the loop running for ~35 minutes - well past Lambda's 15-minute hard
limit. The Lambda was killed mid-sleep without writing a CloudFormation
response, leaving the stack to wait for its 60-minute step timeout.

Cap each iteration's backoff at 20 seconds so the worst case stays under
the deployer Lambda's default 5-minute timeout, and add a deadline guard
based on context.getRemainingTimeInMillis() that aborts the loop early
if the remaining budget cannot cover the next request plus reserve for
registration retries, optional pruning, and the CFN response submission.

Expose healthCheckRetryAttempts and healthCheckMaxBackoff on
ServiceRegistrationProps for users who need to tune the loop. Defaults
are emitted only when explicitly set so existing CloudFormation
templates do not see a property diff.
@pcholakov pcholakov force-pushed the pavel/healthcheck-deadline-guard branch from 799cc37 to 34f58ef Compare May 6, 2026 19:14
@pcholakov pcholakov merged commit 79d00ec into main May 6, 2026
2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose health check retry configuration on ServiceDeployer

1 participant