Skip to content

feat(cli): gpu-dev debug — self-serve reservation diagnostics#216

Merged
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-command
Jun 16, 2026
Merged

feat(cli): gpu-dev debug — self-serve reservation diagnostics#216
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-command

Conversation

@wdvr

@wdvr wdvr commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Adds gpu-dev debug [reservation_id] so users can diagnose why their own box died or won't connect — without lambda/CloudWatch or kubectl access.

Renders from DynamoDB (data the reservation + expiry lambdas already write):

  • Why it endedfailure_reason for any status (the existing show only surfaces it on failed; an active-but-dead pod is exactly when you need it)
  • OOM events — count / last time / container
  • Status timeline — the full status_history
  • Captured pod logs — the lambda's snapshot
  • Recovery hintsgpu-dev cancel a dead active box to free it + the disk, gpu-dev disk unlock <name>, re-reserve --disk <name>

Also fixes a latent bug: oom_count/last_oom_at/oom_container (plus node_name/disk_name) are now included in get_connection_info — the OOM banner in _show_single_reservation was dead code because those keys were never populated.

Resolves the ezyang thread ("gpu-dev says active but I can't ssh / I don't have lambda logs access"): with #211 deployed, a dead pod flips to failed + reason within ~1 min, and gpu-dev debug shows it.

Tests: tests/unit/cli/test_debug.py (6). Full suite green (1192 passed).

Note: this is the no-infra layer. Full raw lambda-log access is a separate follow-up (see discussion — leaning toward an on-demand CloudWatch Logs Insights query over S3 archival).

Users couldn't tell why a reservation died or how to recover without lambda/kubectl
access. 'gpu-dev debug [id]' renders, from DynamoDB only (no cluster access):
- failure reason (for ANY status, not just 'failed' like 'show')
- OOM events (count / last / container)
- full status-history timeline
- captured pod-logs snapshot
- recovery hints (cancel to free a dead box; disk unlock; re-reserve --disk)

Also adds oom_count/last_oom_at/oom_container/node_name/disk_name to
get_connection_info (the OOM banner in _show_single_reservation was dead code since
those keys were never populated). Tests: tests/unit/cli/test_debug.py (6).
@wdvr wdvr merged commit 3b862fb into main Jun 16, 2026
3 checks passed
@wdvr wdvr deleted the feat/gpu-dev-debug-command branch June 16, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant