feat(cli): gpu-dev debug — self-serve reservation diagnostics#216
Merged
Conversation
Users couldn't tell why a reservation died or how to recover without lambda/kubectl access. 'gpu-dev debug [id]' renders, from DynamoDB only (no cluster access): - failure reason (for ANY status, not just 'failed' like 'show') - OOM events (count / last / container) - full status-history timeline - captured pod-logs snapshot - recovery hints (cancel to free a dead box; disk unlock; re-reserve --disk) Also adds oom_count/last_oom_at/oom_container/node_name/disk_name to get_connection_info (the OOM banner in _show_single_reservation was dead code since those keys were never populated). Tests: tests/unit/cli/test_debug.py (6).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
gpu-dev debug [reservation_id]so users can diagnose why their own box died or won't connect — without lambda/CloudWatch or kubectl access.Renders from DynamoDB (data the reservation + expiry lambdas already write):
failure_reasonfor any status (the existingshowonly surfaces it onfailed; an active-but-dead pod is exactly when you need it)status_historygpu-dev cancela dead active box to free it + the disk,gpu-dev disk unlock <name>, re-reserve--disk <name>Also fixes a latent bug:
oom_count/last_oom_at/oom_container(plusnode_name/disk_name) are now included inget_connection_info— the OOM banner in_show_single_reservationwas dead code because those keys were never populated.Resolves the ezyang thread ("gpu-dev says active but I can't ssh / I don't have lambda logs access"): with #211 deployed, a dead pod flips to
failed+ reason within ~1 min, andgpu-dev debugshows it.Tests:
tests/unit/cli/test_debug.py(6). Full suite green (1192 passed).Note: this is the no-infra layer. Full raw lambda-log access is a separate follow-up (see discussion — leaning toward an on-demand CloudWatch Logs Insights query over S3 archival).