Skip to content

[CRCR] Add EventBridge sweeper configuration and interval variable#806

Draft
can-gaa-hou wants to merge 1 commit into
mainfrom
eventbridge
Draft

[CRCR] Add EventBridge sweeper configuration and interval variable#806
can-gaa-hou wants to merge 1 commit into
mainfrom
eventbridge

Conversation

@can-gaa-hou

@can-gaa-hou can-gaa-hou commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Overview

Adds an EventBridge scheduled rule that triggers the existing CRCR callback lambda every 10 minutes (configurable via the new sweeper_interval_minutes variable) to drive the active "zombie job" sweeper.

  • aws_cloudwatch_event_rule.sweeperrate(10 minutes) schedule.
  • aws_cloudwatch_event_target.sweeper — targets the callback lambda with a fixed payload {"source": "crcr.sweeper"}, which the handler routes on to enter the cleanup branch.
  • aws_lambda_permission.sweeper_invoke — allows events.amazonaws.com to invoke the callback lambda, scoped to this rule's ARN.
  • New sweeper_interval_minutes variable (default 10).

Background

Redis keyspace expiration events aren't well-suited for stateless Lambdas, so we use an EventBridge cron + a Redis ZSET (scored by expected timeout) instead. In-progress jobs are tracked in the ZSET and removed on normal completion; anything left past its expiry is a zombie. On each tick the sweeper reaps zombies — marking them as timed out in HUD and on GitHub, then clearing them from the cache.

Dependency / deploy order

This is the infra half of the feature. The callback lambda's routing + cleanup logic lives in pytorch/test-infra#8198 (if event.get("source") == "crcr.sweeper"). The input payload here matches that branch.

Refs:

@datadog-pytorch-via-lf

This comment has been minimized.

@atalman atalman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug with rate() singular/plural grammar

If sweeper_interval_minutes is set to 1, the schedule expression becomes rate(1 minutes), which is invalid — AWS requires rate(1 minute) (singular). Consider either adding a validation constraint:

variable "sweeper_interval_minutes" {
  ...
  validation {
    condition     = var.sweeper_interval_minutes >= 2
    error_message = "Must be >= 2 to avoid rate() singular/plural grammar issue."
  }
}

Or handling it with a conditional in the expression:

schedule_expression = "rate(${var.sweeper_interval_minutes} ${var.sweeper_interval_minutes == 1 ? \"minute\" : \"minutes\"})"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants