Cap health check backoff and add Lambda deadline guard#93
Merged
Conversation
c03a7ef to
a35daa6
Compare
b4aa5a4 to
799cc37
Compare
The exponential backoff in the ServiceDeployer custom resource handler grew without a cap, so a sustained admin reachability failure could keep the loop running for ~35 minutes - well past Lambda's 15-minute hard limit. The Lambda was killed mid-sleep without writing a CloudFormation response, leaving the stack to wait for its 60-minute step timeout. Cap each iteration's backoff at 20 seconds so the worst case stays under the deployer Lambda's default 5-minute timeout, and add a deadline guard based on context.getRemainingTimeInMillis() that aborts the loop early if the remaining budget cannot cover the next request plus reserve for registration retries, optional pruning, and the CFN response submission. Expose healthCheckRetryAttempts and healthCheckMaxBackoff on ServiceRegistrationProps for users who need to tune the loop. Defaults are emitted only when explicitly set so existing CloudFormation templates do not see a property diff.
799cc37 to
34f58ef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #91. ServiceDeployer's exponential-backoff health check could run for ~35 minutes when the Restate admin endpoint was unreachable - well past Lambda's 15-minute hard limit. The Lambda would be killed mid-sleep without sending a CloudFormation response, leaving CFN to wait its 60-minute step timeout before failing the stack.
The schedule is
2 ** attempt * 1_000 msplus 0-2 s jitter, capped only byMAX_HEALTH_CHECK_ATTEMPTS = 10(which was bumped from 5 in 1.5.0). The last two sleeps alone were ~8.5 min and ~17 min.Changes
context.getRemainingTimeInMillis()that aborts the loop with a clear error if the remaining budget can't cover the next request plus 60 s reserved for the registration loop, optional pruning, and the CFN response submission.healthCheckRetryAttemptsandhealthCheckMaxBackoffonServiceRegistrationPropsfor users who need to tune the loop. Defaults are emitted only when the caller sets them explicitly, so existing CFN templates see no property diff.Test plan
npm run buildcleannpm run testall 11 pass (1 snapshot updated for the registration handler asset hash, expected since the Lambda code changed)adminUrland verify the deployer Lambda errors inside its own timeout instead of CFN sitting at the resource for 60 minutesNotes
The registration loop (
MAX_REGISTRATION_ATTEMPTS = 3, max ~56 s of sleeps) is unchanged - it's already short and the new 60 s reserve in the deadline guard accounts for it.