[CRCR] Add EventBridge sweeper configuration and interval variable#806
Draft
can-gaa-hou wants to merge 1 commit into
Draft
[CRCR] Add EventBridge sweeper configuration and interval variable#806can-gaa-hou wants to merge 1 commit into
can-gaa-hou wants to merge 1 commit into
Conversation
This comment has been minimized.
This comment has been minimized.
atalman
reviewed
Jun 23, 2026
atalman
left a comment
Contributor
There was a problem hiding this comment.
Potential bug with rate() singular/plural grammar
If sweeper_interval_minutes is set to 1, the schedule expression becomes rate(1 minutes), which is invalid — AWS requires rate(1 minute) (singular). Consider either adding a validation constraint:
variable "sweeper_interval_minutes" {
...
validation {
condition = var.sweeper_interval_minutes >= 2
error_message = "Must be >= 2 to avoid rate() singular/plural grammar issue."
}
}Or handling it with a conditional in the expression:
schedule_expression = "rate(${var.sweeper_interval_minutes} ${var.sweeper_interval_minutes == 1 ? \"minute\" : \"minutes\"})"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds an EventBridge scheduled rule that triggers the existing CRCR
callbacklambda every 10 minutes (configurable via the newsweeper_interval_minutesvariable) to drive the active "zombie job" sweeper.aws_cloudwatch_event_rule.sweeper—rate(10 minutes)schedule.aws_cloudwatch_event_target.sweeper— targets the callback lambda with a fixed payload{"source": "crcr.sweeper"}, which the handler routes on to enter the cleanup branch.aws_lambda_permission.sweeper_invoke— allowsevents.amazonaws.comto invoke the callback lambda, scoped to this rule's ARN.sweeper_interval_minutesvariable (default10).Background
Redis keyspace expiration events aren't well-suited for stateless Lambdas, so we use an EventBridge cron + a Redis ZSET (scored by expected timeout) instead. In-progress jobs are tracked in the ZSET and removed on normal completion; anything left past its expiry is a zombie. On each tick the sweeper reaps zombies — marking them as timed out in HUD and on GitHub, then clearing them from the cache.
Dependency / deploy order
This is the infra half of the feature. The callback lambda's routing + cleanup logic lives in pytorch/test-infra#8198 (
if event.get("source") == "crcr.sweeper"). Theinputpayload here matches that branch.Refs: