Skip to content

feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention#217

Merged
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-logs
Jun 16, 2026
Merged

feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention#217
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-logs

Conversation

@wdvr

@wdvr wdvr commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Approach A: on-demand lambda logs (no S3 archival)

Gives users the real reservation/expiry lambda logs for their own reservation without CloudWatch/kubectl access — by querying CloudWatch Logs Insights on demand through the processor Function URL.

Backend (reservation_processor)

  • handler Function-URL branch now routes action == "get_logs"handle_get_logs:
    • Ownership enforcedfind_reservation_by_prefix(reservation_id, user_id=...) only ever resolves the caller's own reservations (404 otherwise).
    • Runs fields @timestamp,@message | filter @message like "<id8>" | sort @timestamp asc | limit 1000 over /aws/lambda/<prefix>-reservation-processor + -reservation-expiry, scoped to the reservation's lifetime (created→ended, capped to 14d retention) so the scan is cheap.
    • Returns {"lines":[{timestamp,message}...]}; falls back to processor-only group if one is missing.
  • IAM: processor role + logs:StartQuery/GetLogEvents/FilterLogEvents (scoped to the two log groups) and logs:GetQueryResults/StopQuery.

CLI

  • gpu-dev debug --logs [id]ReservationManager.get_reservation_logs posts {action:"get_logs"} over the existing SigV4 Function URL (the same path --direct uses; _signed_post got a timeout arg, 70s for Insights). Renders the lines in a panel. No new Function URL or invoke permission.

Also: expiry log retention

The reservation-expiry log group was auto-created with no retention and had grown unbounded (multi-GB). Now managed with retention_in_days = 30.

Deploy

  1. tofu apply (processor lambda code + IAM) — prod + east1.
  2. One-time per workspace, import the pre-existing expiry log group before apply (else "already exists"):
    tofu import aws_cloudwatch_log_group.reservation_expiry_log_group /aws/lambda/pytorch-gpu-dev-reservation-expiry
  3. Release the CLI (gpu-dev debug --logs).

gpu-dev debug (no --logs, merged in #216) already works today with zero infra — this PR adds the raw-logs deep-dive on top.

Tests: test_get_logs.py (4) + test_debug.py --logs cases. Full suite 1199 passed.

…Watch + expiry log retention

Users can't reach CloudWatch/lambda logs. Add an on-demand path:
- processor Function URL gains an action='get_logs' branch -> handle_get_logs():
  verifies ownership (find_reservation_by_prefix with user_id), runs a CloudWatch
  Logs Insights query filtered to the reservation id across the processor + expiry
  log groups (scoped to the reservation's lifetime), returns the lines.
- IAM: processor role gets logs:StartQuery/GetLogEvents/FilterLogEvents (scoped to
  the two log groups) + logs:GetQueryResults/StopQuery.
- CLI: 'gpu-dev debug --logs' calls it via the existing SigV4 Function URL path
  (ReservationManager.get_reservation_logs; _signed_post gains a timeout arg, 70s).
  No new infra/invoke perms — rides the same Function URL as --direct claims.

Also: manage the expiry lambda's CloudWatch log group with retention_in_days=30
(it was auto-created with NO retention and had grown unbounded, multi-GB). Needs a
one-time 'tofu import' per workspace (command in the resource comment).

Tests: tests/unit/lambda_fn/test_get_logs.py (4: fields/ownership/happy/incomplete)
+ tests/unit/cli/test_debug.py --logs cases. Full suite 1199 passed.
@wdvr wdvr merged commit b80556d into main Jun 16, 2026
3 checks passed
@wdvr wdvr deleted the feat/gpu-dev-debug-logs branch June 16, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant