feat(cli): gpu-dev debug — self-serve reservation diagnostics by wdvr · Pull Request #216 · wdvr/osdc

wdvr · 2026-06-16T19:06:06Z

Adds gpu-dev debug [reservation_id] so users can diagnose why their own box died or won't connect — without lambda/CloudWatch or kubectl access.

Renders from DynamoDB (data the reservation + expiry lambdas already write):

Why it ended — failure_reason for any status (the existing show only surfaces it on failed; an active-but-dead pod is exactly when you need it)
OOM events — count / last time / container
Status timeline — the full status_history
Captured pod logs — the lambda's snapshot
Recovery hints — gpu-dev cancel a dead active box to free it + the disk, gpu-dev disk unlock <name>, re-reserve --disk <name>

Also fixes a latent bug: oom_count/last_oom_at/oom_container (plus node_name/disk_name) are now included in get_connection_info — the OOM banner in _show_single_reservation was dead code because those keys were never populated.

Resolves the ezyang thread ("gpu-dev says active but I can't ssh / I don't have lambda logs access"): with #211 deployed, a dead pod flips to failed + reason within ~1 min, and gpu-dev debug shows it.

Tests: tests/unit/cli/test_debug.py (6). Full suite green (1192 passed).

Note: this is the no-infra layer. Full raw lambda-log access is a separate follow-up (see discussion — leaning toward an on-demand CloudWatch Logs Insights query over S3 archival).

Users couldn't tell why a reservation died or how to recover without lambda/kubectl access. 'gpu-dev debug [id]' renders, from DynamoDB only (no cluster access): - failure reason (for ANY status, not just 'failed' like 'show') - OOM events (count / last / container) - full status-history timeline - captured pod-logs snapshot - recovery hints (cancel to free a dead box; disk unlock; re-reserve --disk) Also adds oom_count/last_oom_at/oom_container/node_name/disk_name to get_connection_info (the OOM banner in _show_single_reservation was dead code since those keys were never populated). Tests: tests/unit/cli/test_debug.py (6).

wdvr merged commit 3b862fb into main Jun 16, 2026
3 checks passed

wdvr deleted the feat/gpu-dev-debug-command branch June 16, 2026 19:11

wdvr mentioned this pull request Jun 16, 2026

feat(debug): gpu-dev debug --logs (on-demand lambda logs) + expiry log retention #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): gpu-dev debug — self-serve reservation diagnostics#216

feat(cli): gpu-dev debug — self-serve reservation diagnostics#216
wdvr merged 1 commit into
mainfrom
feat/gpu-dev-debug-command

wdvr commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant