Pruned 2026-04-29. Pre-v0.3.0 sections (A-E, Open #-1..#6) were resolved by the v0.2/v0.3 lifecycle/manifest/upload PRs and removed. Items below are the ones still hitting us in real runs after the v0.3.0 release.
Fixed 2026-05-01 in commit 79e8adb (fix(poller): drain metrics + stdout before terminal-status return). The events-tail block now latches a
terminal_after_drain: Option<RunStatus> instead of returning; the
metrics + stdout tails always run, and the loop finalises (destroy +
update_run_status) only after drain. Same wiring covers the failed
branch. Regression test in crates/xrun-poller/tests/loop_progress.rs
asserts that a tick which sees done:ok and 3 metrics in the same window
lands all 3 in the store.
Fixed 2026-04-29 in loop_runner.rs:
progress_this_tick = truenow set on every heartbeat tick and on every stdout byte block — both resetlast_active_atviaupdate_instance_usage.- Idle-kill event message now includes
threshold=Nmin, last_active=HH:MMZ(oranchor=created_at HH:MMZwhen no activity was ever observed).
Remaining open sub-item:
2. Anchor on FIXED 2026-05-01 in train_start event timestamp, not created_at, when no
metric activity has been observed yet.budget::idle_anchor:
priority is last_active_at → train_started_at → created_at. Poller
caches the ts and rehydrates from the events table on restart.
On 01KQCT7FP8FS68478Y5K4160VE, host OOM killer SIGKILL'd the python child
of bash launch_*.sh. The actual failure is stage failure but the poller
saw bash+sshd still alive and fell through to the idle timer.
Fixed 2026-04-29:
- When
on_stage_failed: stop_instancefires, the poller now emits an explicitinstance.auto_destroyedevent withauto-destroyed: stage_failed (on_stage_failed=stop_instance)before destroying. policy.on_idle_minutesfrom manifest already applied toidle_timeout_secscaps (confirmed inlaunch.rs); was not a real bug.
Still open:
Poll the actual launched PIDFIXED 2026-05-01: vast'sbuild_launch_commandnow writes$!to/workspace/run/run.pid. The poller calls a newVendorAdapter::process_alive(handle)method each tick and, oncetrain_starthas fired, marks the run failed when the PID is gone but nodone:okwas emitted. Syntheticstage_failed:failevent records the cause. Same wiring onxrun-ssh. Regression tests incrates/xrun-poller/tests/loop_progress.rs::test_poller_pid_dead_*.
Fixed 2026-04-29 in run_detail.py: _load_logs now only calls log.clear()
when services.get_logs() returns non-empty content. An empty result (SSH
gone after instance destroyed) preserves the last displayed snapshot instead
of wiping the panel.
FIXED 2026-05-01 in docs/MANIFEST.md — expanded the existing exclude
section with tar --exclude framing, common-mistake mappings, and a
local-verify recipe. The 6 GB incident is documented as the cautionary
tale.
Still open as a bonus ask (separate from the doc fix):
xrun doctor --manifest <path> --dry-run-uploadmode that prints the list of files that would be uploaded (or top 20 by size). Lets the user verify the pattern before paying for the bytes.
Auto-install pigz in setup whenFIXED 2026-05-01 incompress: gzipis selected on a remote that doesn't have it.xrun-vast/upload.rs: before the gzip tar pipe runs we issuecommand -v pigz || apt-get install -y pigzonce over SSH. Idempotent; no-op on apt-less base images.- Overlap upload+extract: today
upload: okonly fires after extract completes. Stream pipe should let them run concurrently. - Cost / time accounting: with heartbeat metrics now in DB, dashboard
could show
idle_time_$ / total_$(provision+upload spend vs train spend). On 2026-04-29 a 30-min training cycle had 8% overhead from provision+upload time at $0.15/hr — worth surfacing.