feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention by wdvr · Pull Request #217 · wdvr/osdc

wdvr · 2026-06-16T19:15:11Z

Approach A: on-demand lambda logs (no S3 archival)

Gives users the real reservation/expiry lambda logs for their own reservation without CloudWatch/kubectl access — by querying CloudWatch Logs Insights on demand through the processor Function URL.

Backend (`reservation_processor`)

handler Function-URL branch now routes action == "get_logs" → handle_get_logs:
- Ownership enforced — find_reservation_by_prefix(reservation_id, user_id=...) only ever resolves the caller's own reservations (404 otherwise).
- Runs fields @timestamp,@message | filter @message like "<id8>" | sort @timestamp asc | limit 1000 over /aws/lambda/<prefix>-reservation-processor + -reservation-expiry, scoped to the reservation's lifetime (created→ended, capped to 14d retention) so the scan is cheap.
- Returns {"lines":[{timestamp,message}...]}; falls back to processor-only group if one is missing.
IAM: processor role + logs:StartQuery/GetLogEvents/FilterLogEvents (scoped to the two log groups) and logs:GetQueryResults/StopQuery.

CLI

gpu-dev debug --logs [id] → ReservationManager.get_reservation_logs posts {action:"get_logs"} over the existing SigV4 Function URL (the same path --direct uses; _signed_post got a timeout arg, 70s for Insights). Renders the lines in a panel. No new Function URL or invoke permission.

Also: expiry log retention

The reservation-expiry log group was auto-created with no retention and had grown unbounded (multi-GB). Now managed with retention_in_days = 30.

Deploy

tofu apply (processor lambda code + IAM) — prod + east1.
One-time per workspace, import the pre-existing expiry log group before apply (else "already exists"):
tofu import aws_cloudwatch_log_group.reservation_expiry_log_group /aws/lambda/pytorch-gpu-dev-reservation-expiry
Release the CLI (gpu-dev debug --logs).

gpu-dev debug (no --logs, merged in #216) already works today with zero infra — this PR adds the raw-logs deep-dive on top.

Tests: test_get_logs.py (4) + test_debug.py --logs cases. Full suite 1199 passed.

…Watch + expiry log retention Users can't reach CloudWatch/lambda logs. Add an on-demand path: - processor Function URL gains an action='get_logs' branch -> handle_get_logs(): verifies ownership (find_reservation_by_prefix with user_id), runs a CloudWatch Logs Insights query filtered to the reservation id across the processor + expiry log groups (scoped to the reservation's lifetime), returns the lines. - IAM: processor role gets logs:StartQuery/GetLogEvents/FilterLogEvents (scoped to the two log groups) + logs:GetQueryResults/StopQuery. - CLI: 'gpu-dev debug --logs' calls it via the existing SigV4 Function URL path (ReservationManager.get_reservation_logs; _signed_post gains a timeout arg, 70s). No new infra/invoke perms — rides the same Function URL as --direct claims. Also: manage the expiry lambda's CloudWatch log group with retention_in_days=30 (it was auto-created with NO retention and had grown unbounded, multi-GB). Needs a one-time 'tofu import' per workspace (command in the resource comment). Tests: tests/unit/lambda_fn/test_get_logs.py (4: fields/ownership/happy/incomplete) + tests/unit/cli/test_debug.py --logs cases. Full suite 1199 passed.

wdvr merged commit b80556d into main Jun 16, 2026
3 checks passed

wdvr deleted the feat/gpu-dev-debug-logs branch June 16, 2026 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention#217

feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention#217
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-logs

wdvr commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Jun 16, 2026

Approach A: on-demand lambda logs (no S3 archival)

Backend (reservation_processor)

CLI

Also: expiry log retention

Deploy

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Backend (`reservation_processor`)