fix: flush tunnel logger before agent shutdown#762
Conversation
The agent-side tunnelLogger forwards log lines over the tunnel asynchronously via a background worker. On shutdown (e.g. a failing lifecycle hook), the agent process exited and cancelled the context before the worker drained the channel, silently dropping the last queued line(s). Add a Flush() that drains the channel and is deferred in the container setup Run path, send with a cancellation-detached context, and flush before Fatal exits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe PR fixes dropped final log lines during lifecycle hook failures by introducing a queued, flushable logger. The tunnel logger now routes messages through a buffered channel, implements a ChangesTunnel Logger Flush Mechanism
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the inline anonymous interface{ Flush() } assertion with a named
Flusher interface, plus a compile-time assertion that *tunnelLogger
implements it, so the contract is explicit and rename-safe.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f3d7db0 to
7e0944a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/agent/tunnelserver/logger.go`:
- Line 30: The deferred Flush currently waits up to 5s per queued logEntry in
handle(), which with logChan (buffer size 1000) can hang shutdown for ~5000s;
change Flush/handle() to use a single overall deadline (e.g., create a context
with timeout or a single timer before draining) and loop draining logChan until
either the channel is empty or the overall deadline expires, then discard
remaining entries; update the logic referenced by logChan (buffer
initialization) and the drain/flush code in handle() and the Flush
implementation (also the code around lines 345-346) to stop per-entry timeouts
and enforce one capped timeout for the entire drain operation.
- Around line 86-97: Flush currently aborts when s.ctx is canceled because it
bails out of sending the logEntry flush marker and waiting for ack via
s.ctx.Done(); change Flush in tunnelLogger so it always enqueues a flush marker
and waits for the worker to acknowledge it even if s.ctx is canceled.
Concretely: stop using s.ctx.Done() to short-circuit the send and the ack wait;
instead ensure the flush marker (logEntry{flush: ack}) is sent (block until
successful or until an independent timeout) and then wait on ack or on a
dedicated worker-completion channel (e.g., a workerDone or closed done channel
set when the logger worker exits) or an explicit deadline; update Flush to wait
on ack || workerDone || timeout so deferred calls in SetupContainerCmd.Run and
Fatal/Fatalf truly block until queued lines are processed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7fffa07f-3b81-401c-aaf4-357e28973409
📒 Files selected for processing (2)
cmd/agent/container/setup.gopkg/agent/tunnelserver/logger.go
|
A test added to |
Add an up-handle-errors e2e spec: a postCreateCommand prints a unique marker to stderr then exits non-zero. The agent forwards lifecycle hook output over the tunnel asynchronously, so on failure it must flush the final queued line before tearing down. Asserts devpod up fails and the marker still appears in the captured log output. Regression guard for the bug fixed in this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
e2e/tests/up/handles_errors.go (1)
89-95: ⚡ Quick winTighten the assertion to the JSON stderr stream.
Right now this still passes if the marker shows up as plain text or gets rerouted to
stdout, even though the contract under test is--log-output=jsonpreserving the final lifecycle stderr line. Validate the JSON stream here and searchstderrdirectly.Proposed change
stdout, stderr, err := f.DevPodUpStreams(ctx, tempDir, "--log-output=json") framework.ExpectError(err, "expected lifecycle hook failure") + framework.ExpectNoError(verifyLogStream(strings.NewReader(stdout))) + framework.ExpectNoError(verifyLogStream(strings.NewReader(stderr))) framework.ExpectNoError( findMessage( - strings.NewReader(stdout+stderr), + strings.NewReader(stderr), "DEVPOD_LIFECYCLE_FLUSH_MARKER", ), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/tests/up/handles_errors.go` around lines 89 - 95, The test currently concatenates stdout+stderr and searches for the DEVPOD_LIFECYCLE_FLUSH_MARKER, which can falsely pass if the marker appears in stdout; instead call f.DevPodUpStreams(...) as before but assert against stderr only and validate it as JSON (e.g., parse/ensure it's valid JSON and contains the expected final lifecycle stderr line) when using "--log-output=json"; update the call sites using findMessage to read from strings.NewReader(stderr) (or a JSON-aware helper) and replace the current ExpectNoError(findMessage(strings.NewReader(stdout+stderr), "DEVPOD_LIFECYCLE_FLUSH_MARKER")) with an assertion that parses stderr JSON and verifies the marker is present in the stderr log entry produced by DevPodUpStreams.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@e2e/tests/up/handles_errors.go`:
- Around line 89-95: The test currently concatenates stdout+stderr and searches
for the DEVPOD_LIFECYCLE_FLUSH_MARKER, which can falsely pass if the marker
appears in stdout; instead call f.DevPodUpStreams(...) as before but assert
against stderr only and validate it as JSON (e.g., parse/ensure it's valid JSON
and contains the expected final lifecycle stderr line) when using
"--log-output=json"; update the call sites using findMessage to read from
strings.NewReader(stderr) (or a JSON-aware helper) and replace the current
ExpectNoError(findMessage(strings.NewReader(stdout+stderr),
"DEVPOD_LIFECYCLE_FLUSH_MARKER")) with an assertion that parses stderr JSON and
verifies the marker is present in the stderr log entry produced by
DevPodUpStreams.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0424563a-a1f0-4746-93b8-d6ac89dd9b7a
📒 Files selected for processing (2)
e2e/tests/up/handles_errors.goe2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer.json
✅ Files skipped from review due to trivial changes (1)
- e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer.json
A lone final log line drains before teardown (the async worker has ~ms to send it), so the original single-line fixture passed even without the flush fix. Emit a burst of stderr lines ending with the marker so the sender is still busy when the hook fails, making the dropped-last-line reliably reproduce without the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
devpod's "failed to run <command>" error echoes the command verbatim, so an inline marker matched the error text even when the hook's output line was actually dropped (false positive). Move the burst+marker into .devcontainer/postCreateCommand.sh so the command echo contains only the script path; the marker can now only match a genuinely forwarded line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Added a regression test under
EvidenceWithout the fix (agent's (only 4 of 300 burst lines were delivered before teardown; the marker was dropped) With the fix (this PR) — spec PASSES: |
Flush previously short-circuited on s.ctx.Done(), so once the context was cancelled it could skip enqueueing the flush marker or return before the worker drained it — defeating the barrier on exactly the shutdown path it guards (deferred flush in SetupContainerCmd.Run, Fatal/Fatalf). Add a done channel closed when worker() exits and wait on that instead. This keeps Flush a real barrier under cancellation while still avoiding a deadlock when the worker has already stopped.
handle() applies a 5s timeout per message, and Flush waits for the whole queue to drain, so a stalled-but-open tunnel with a full buffer could block shutdown for buffer-size × 5s (~5000s). Cap the entire flush with one 30s deadline; the worker (and the process) tear down afterward.
Fixes #763
Problem
The last stdout/stderr line of a lifecycle hook (e.g. a failing
postStartCommand.sh) is missing from thedevpod upoutput.The agent-side
tunnelLoggerforwards log lines over the tunnel asynchronously via a background worker goroutine. On shutdown — such as when a lifecycle hook exits non-zero — the agent process returns the error and the context is cancelled before the worker drains the channel, so the last queued line(s) are silently dropped (_, _ = s.client.Log(...)).Reproduce
The
test-devpod-stderrbranch has apostStartCommand.shthat prints two stdout/stderr lines, sleeps, prints two more, thenexit 10:Actual (before fix): the final stderr line
TEST STDERR MESSAGE then exit immediatelynever appears — the output jumps from the stdout line straight to theexit status 10failure:Expected (after fix): the final stderr line is forwarded before teardown:
(The
exit status 10failure itself is expected — the branch exists solely to demonstrate the dropped last line.)Fix
Flush()method that drains the channel and blocks until queued messages are sent.logger.Flush()in the container setupRunpath (runs whilectxis still alive, before teardown).context.WithoutCancel) so in-flight messages aren't dropped at shutdown.Fatal/Fatalfcallos.Exit.enqueue()helper to cut duplication.Test plan
Reproduce against the
test-devpod-stderrbranch with a locally-built CLI and agent.Because a dev build reports version
v0.0.0, DevPod injects the agent over stdin via theBinaryManager, which checks${TMPDIR}/devpod-cache/devpod-linux-<arch>(FileCacheSource) before falling back to the GitHub release. Dropping the local build there makes DevPod inject it instead of the published binary — no HTTP server or release needed.TEST STDERR MESSAGE then exit immediatelyline is missing.warn TEST STDERR MESSAGE then exit immediately) before theexit status 10failure.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests