Skip to content

[1/4 Add AWS integration layer#74

Closed
ericc59 wants to merge 5 commits into
mainfrom
add-aws-integration
Closed

[1/4 Add AWS integration layer#74
ericc59 wants to merge 5 commits into
mainfrom
add-aws-integration

Conversation

@ericc59
Copy link
Copy Markdown
Contributor

@ericc59 ericc59 commented Jan 27, 2026

Summary

Adds foundational AWS integration components for distributed benchmark execution:

  • DynamoDBProgressManager: Distributed task coordination with atomic claim/complete/fail operations, FAILED_PERMANENT status for exhausted retries, run-level cost/token aggregate rollups
  • DistributedRateLimiter: DynamoDB-backed token bucket with rate/period interface matching provider_config.yml, exponential backoff on contention
  • ExecutionContext: Environment detection for LOCAL/EC2/FARGATE/BATCH (handles IMDSv2 401/403 responses)
  • Dependencies: boto3 moved to [aws] extras, moto to [test] extras

Part of

This is PR 1/3 for AWS distributed execution:

  1. This PR: AWS integration layer
  2. Batch worker + Dockerfile (depends on 1)
  3. Lambda functions (depends on 1)

Test plan

  • All 20 AWS integration tests pass
  • Manual verification with mocked DynamoDB

…iting, environment detection

- DynamoDBProgressManager for distributed task coordination with atomic claim/complete/fail operations
- DistributedRateLimiter using DynamoDB with rate/period interface and exponential backoff on contention
- ExecutionContext detection for LOCAL/EC2/FARGATE/BATCH environments (handles IMDSv2)
- FAILED_PERMANENT status prevents exhausted-retry tasks from re-queuing
- Run-level cost/token aggregate rollups
- boto3 moved to [aws] extras, moto to [test] extras
@ericc59 ericc59 force-pushed the add-aws-integration branch from b6c2492 to fdaddb8 Compare January 27, 2026 23:42
@ericc59 ericc59 force-pushed the add-aws-integration branch from b851227 to cffe810 Compare January 27, 2026 23:46
@ericc59 ericc59 changed the title Add AWS integration layer (PR 1/3) Add AWS integration layer (PR 1/4) Jan 29, 2026
@ericc59 ericc59 changed the title Add AWS integration layer (PR 1/4) [1/4 Add AWS integration layer Jan 29, 2026
@ericc59 ericc59 force-pushed the add-aws-integration branch from ab53802 to abeecff Compare January 29, 2026 17:59
@hlfshell
Copy link
Copy Markdown

hlfshell commented Jan 29, 2026

I could be missing something here, but - we lock the task in claim_task to prevent it from being worked on, and use a conditional update to prevent a race condition - this is a great pattern. But I don't see anywhere that we update the status of this reserved task at a routine update time, forming a watchdog timer pattern. Without this, it's possible for tasks to be claimed, our daemon process to die, and the tasks to never be unlocked.

get_pending_task correctly filters for PENDING and FAILED, but not IN_PROGRESS paired with a "last updated" timeout, which would get around this. If we were not confident in the execution time of our daemonized process, we would have to do the watchdog timer pattern; otherwise we can get away with a simple "if IN_PROGRESS and hasn't been updated in 15 minutes" approach to that query to catch these stray tasks.

...I could just be missing a task timeout somewhere too.

When a worker dies mid-task, the task stays IN_PROGRESS forever.
This adds stale task detection to prevent task lockout:

- get_pending_tasks() now returns IN_PROGRESS tasks that haven't
  been updated within stale_timeout_seconds (default 15 minutes)
- claim_task() allows reclaiming stale IN_PROGRESS tasks
- Timeout configurable, set to 0 to disable
@ericc59
Copy link
Copy Markdown
Contributor Author

ericc59 commented Jan 30, 2026

I could be missing something here, but - we lock the task in claim_task to prevent it from being worked on, and use a conditional update to prevent a race condition - this is a great pattern. But I don't see anywhere that we update the status of this reserved task at a routine update time, forming a watchdog timer pattern. Without this, it's possible for tasks to be claimed, our daemon process to die, and the tasks to never be unlocked.

get_pending_task correctly filters for PENDING and FAILED, but not IN_PROGRESS paired with a "last updated" timeout, which would get around this. If we were not confident in the execution time of our daemonized process, we would have to do the watchdog timer pattern; otherwise we can get away with a simple "if IN_PROGRESS and hasn't been updated in 15 minutes" approach to that query to catch these stray tasks.

...I could just be missing a task timeout somewhere too.

Good catch. Pushed a commit to address that.

@ericc59 ericc59 closed this Jan 30, 2026
@ericc59 ericc59 deleted the add-aws-integration branch January 30, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants