Cap health check backoff and add Lambda deadline guard by pcholakov · Pull Request #93 · restatedev/cdk

pcholakov · 2026-05-05T19:44:22Z

Summary

Closes #91. ServiceDeployer's exponential-backoff health check could run for ~35 minutes when the Restate admin endpoint was unreachable - well past Lambda's 15-minute hard limit. The Lambda would be killed mid-sleep without sending a CloudFormation response, leaving CFN to wait its 60-minute step timeout before failing the stack.

The schedule is 2 ** attempt * 1_000 ms plus 0-2 s jitter, capped only by MAX_HEALTH_CHECK_ATTEMPTS = 10 (which was bumped from 5 in 1.5.0). The last two sleeps alone were ~8.5 min and ~17 min.

Changes

Cap each iteration's backoff at 20 s so the worst-case loop stays under the 5-minute default Lambda timeout (~216 s vs 300 s budget) without changing the deployer Lambda's timeout or anyone's CFN template.
Add a deadline guard using context.getRemainingTimeInMillis() that aborts the loop with a clear error if the remaining budget can't cover the next request plus 60 s reserved for the registration loop, optional pruning, and the CFN response submission.
Expose healthCheckRetryAttempts and healthCheckMaxBackoff on ServiceRegistrationProps for users who need to tune the loop. Defaults are emitted only when the caller sets them explicitly, so existing CFN templates see no property diff.

Test plan

npm run build clean
npm run test all 11 pass (1 snapshot updated for the registration handler asset hash, expected since the Lambda code changed)
New tests verify the new props are forwarded when set and absent when unset
Existing snapshots unchanged (template stability)
Manual: deploy with an unreachable adminUrl and verify the deployer Lambda errors inside its own timeout instead of CFN sitting at the resource for 60 minutes

Notes

The registration loop (MAX_REGISTRATION_ATTEMPTS = 3, max ~56 s of sleeps) is unchanged - it's already short and the new 60 s reserve in the deadline guard accounts for it.

The exponential backoff in the ServiceDeployer custom resource handler grew without a cap, so a sustained admin reachability failure could keep the loop running for ~35 minutes - well past Lambda's 15-minute hard limit. The Lambda was killed mid-sleep without writing a CloudFormation response, leaving the stack to wait for its 60-minute step timeout. Cap each iteration's backoff at 20 seconds so the worst case stays under the deployer Lambda's default 5-minute timeout, and add a deadline guard based on context.getRemainingTimeInMillis() that aborts the loop early if the remaining budget cannot cover the next request plus reserve for registration retries, optional pruning, and the CFN response submission. Expose healthCheckRetryAttempts and healthCheckMaxBackoff on ServiceRegistrationProps for users who need to tune the loop. Defaults are emitted only when explicitly set so existing CloudFormation templates do not see a property diff.

pcholakov force-pushed the pavel/healthcheck-deadline-guard branch from c03a7ef to a35daa6 Compare May 6, 2026 17:46

pcholakov marked this pull request as ready for review May 6, 2026 17:53

pcholakov force-pushed the pavel/healthcheck-deadline-guard branch 3 times, most recently from b4aa5a4 to 799cc37 Compare May 6, 2026 18:08

pcholakov force-pushed the pavel/healthcheck-deadline-guard branch from 799cc37 to 34f58ef Compare May 6, 2026 19:14

pcholakov merged commit 79d00ec into main May 6, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap health check backoff and add Lambda deadline guard#93

Cap health check backoff and add Lambda deadline guard#93
pcholakov merged 1 commit into
mainfrom
pavel/healthcheck-deadline-guard

pcholakov commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pcholakov commented May 5, 2026

Summary

Changes

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant