feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention#217
Merged
Conversation
…Watch + expiry log retention Users can't reach CloudWatch/lambda logs. Add an on-demand path: - processor Function URL gains an action='get_logs' branch -> handle_get_logs(): verifies ownership (find_reservation_by_prefix with user_id), runs a CloudWatch Logs Insights query filtered to the reservation id across the processor + expiry log groups (scoped to the reservation's lifetime), returns the lines. - IAM: processor role gets logs:StartQuery/GetLogEvents/FilterLogEvents (scoped to the two log groups) + logs:GetQueryResults/StopQuery. - CLI: 'gpu-dev debug --logs' calls it via the existing SigV4 Function URL path (ReservationManager.get_reservation_logs; _signed_post gains a timeout arg, 70s). No new infra/invoke perms — rides the same Function URL as --direct claims. Also: manage the expiry lambda's CloudWatch log group with retention_in_days=30 (it was auto-created with NO retention and had grown unbounded, multi-GB). Needs a one-time 'tofu import' per workspace (command in the resource comment). Tests: tests/unit/lambda_fn/test_get_logs.py (4: fields/ownership/happy/incomplete) + tests/unit/cli/test_debug.py --logs cases. Full suite 1199 passed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Approach A: on-demand lambda logs (no S3 archival)
Gives users the real reservation/expiry lambda logs for their own reservation without CloudWatch/kubectl access — by querying CloudWatch Logs Insights on demand through the processor Function URL.
Backend (
reservation_processor)handlerFunction-URL branch now routesaction == "get_logs"→handle_get_logs:find_reservation_by_prefix(reservation_id, user_id=...)only ever resolves the caller's own reservations (404 otherwise).fields @timestamp,@message | filter @message like "<id8>" | sort @timestamp asc | limit 1000over/aws/lambda/<prefix>-reservation-processor+-reservation-expiry, scoped to the reservation's lifetime (created→ended, capped to 14d retention) so the scan is cheap.{"lines":[{timestamp,message}...]}; falls back to processor-only group if one is missing.logs:StartQuery/GetLogEvents/FilterLogEvents(scoped to the two log groups) andlogs:GetQueryResults/StopQuery.CLI
gpu-dev debug --logs [id]→ReservationManager.get_reservation_logsposts{action:"get_logs"}over the existing SigV4 Function URL (the same path--directuses;_signed_postgot atimeoutarg, 70s for Insights). Renders the lines in a panel. No new Function URL or invoke permission.Also: expiry log retention
The
reservation-expirylog group was auto-created with no retention and had grown unbounded (multi-GB). Now managed withretention_in_days = 30.Deploy
tofu apply(processor lambda code + IAM) — prod + east1.tofu import aws_cloudwatch_log_group.reservation_expiry_log_group /aws/lambda/pytorch-gpu-dev-reservation-expirygpu-dev debug --logs).gpu-dev debug(no--logs, merged in #216) already works today with zero infra — this PR adds the raw-logs deep-dive on top.Tests:
test_get_logs.py(4) +test_debug.py--logscases. Full suite 1199 passed.