Skip to content

fix: flush tunnel logger before agent shutdown#762

Merged
skevetter merged 9 commits into
skevetter:mainfrom
terenho-jobber:fix/tunnel-logger-flush-on-shutdown
Jun 3, 2026
Merged

fix: flush tunnel logger before agent shutdown#762
skevetter merged 9 commits into
skevetter:mainfrom
terenho-jobber:fix/tunnel-logger-flush-on-shutdown

Conversation

@terenho-jobber

@terenho-jobber terenho-jobber commented Jun 2, 2026

Copy link
Copy Markdown

Fixes #763

Problem

The last stdout/stderr line of a lifecycle hook (e.g. a failing postStartCommand.sh) is missing from the devpod up output.

The agent-side tunnelLogger forwards log lines over the tunnel asynchronously via a background worker goroutine. On shutdown — such as when a lifecycle hook exits non-zero — the agent process returns the error and the context is cancelled before the worker drains the channel, so the last queued line(s) are silently dropped (_, _ = s.client.Log(...)).

Reproduce

devpod up github.com/terenho-jobber/vscode-remote-try-python@test-devpod-stderr

The test-devpod-stderr branch has a postStartCommand.sh that prints two stdout/stderr lines, sleeps, prints two more, then exit 10:

echo "TEST STDOUT MESSAGE"
echo "TEST STDERR MESSAGE" >&2
sleep 20
echo "TEST STDOUT MESSAGE then exit immediately"
echo "TEST STDERR MESSAGE then exit immediately" >&2   # <-- last line, dropped
exit 10

Actual (before fix): the final stderr line TEST STDERR MESSAGE then exit immediately never appears — the output jumps from the stdout line straight to the exit status 10 failure:

info  running postStartCommands lifecycle hook:  /bin/bash .devcontainer/postStartCommand.sh
warn  TEST STDERR MESSAGE
info  TEST STDOUT MESSAGE
info  TEST STDOUT MESSAGE then exit immediately
error Try using the --debug flag to see a more verbose output
fatal run agent command failed: exit status 1: Process exited with status 1
lifecycle hooks: failed to run: /bin/bash .devcontainer/postStartCommand.sh, error: exit status 10

Expected (after fix): the final stderr line is forwarded before teardown:

info  running postStartCommands lifecycle hook:  /bin/bash .devcontainer/postStartCommand.sh
warn  TEST STDERR MESSAGE
info  TEST STDOUT MESSAGE
info  TEST STDOUT MESSAGE then exit immediately
warn  TEST STDERR MESSAGE then exit immediately
error Try using the --debug flag to see a more verbose output
fatal run agent command failed: exit status 1: Process exited with status 1
lifecycle hooks: failed to run: /bin/bash .devcontainer/postStartCommand.sh, error: exit status 10

(The exit status 10 failure itself is expected — the branch exists solely to demonstrate the dropped last line.)

Fix

  • Add a Flush() method that drains the channel and blocks until queued messages are sent.
  • Defer logger.Flush() in the container setup Run path (runs while ctx is still alive, before teardown).
  • Send with a cancellation-detached context (context.WithoutCancel) so in-flight messages aren't dropped at shutdown.
  • Flush before Fatal/Fatalf call os.Exit.
  • Route all level methods through a small enqueue() helper to cut duplication.

Test plan

Reproduce against the test-devpod-stderr branch with a locally-built CLI and agent.

Because a dev build reports version v0.0.0, DevPod injects the agent over stdin via the BinaryManager, which checks ${TMPDIR}/devpod-cache/devpod-linux-<arch> (FileCacheSource) before falling back to the GitHub release. Dropping the local build there makes DevPod inject it instead of the published binary — no HTTP server or release needed.

# 1. Build the local CLI
go build -o devpod .

# 2. Build the agent for the container's OS/arch (arm64 shown; use amd64 on Intel/most Linux hosts)
DEST="${TMPDIR:-/tmp}/devpod-cache"
mkdir -p "$DEST"
GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -o "$DEST/devpod-linux-arm64" .

# 3. Recreate the workspace so the agent is re-injected and postStartCommand re-fires
./devpod up github.com/terenho-jobber/vscode-remote-try-python@test-devpod-stderr \
  --recreate --ide none
  • Before fix: the final TEST STDERR MESSAGE then exit immediately line is missing.
  • After fix: that line appears (warn TEST STDERR MESSAGE then exit immediately) before the exit status 10 failure.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of log delivery during shutdown so asynchronously forwarded tunnel logs are flushed to the client before connections close, preventing dropped final log lines.
  • Tests

    • Added an end-to-end test that exercises a failing provider lifecycle hook and verifies queued logs are flushed before teardown.
    • Added a test fixture (devcontainer and script) that reproduces and verifies the flush behavior.

The agent-side tunnelLogger forwards log lines over the tunnel
asynchronously via a background worker. On shutdown (e.g. a failing
lifecycle hook), the agent process exited and cancelled the context
before the worker drained the channel, silently dropping the last
queued line(s).

Add a Flush() that drains the channel and is deferred in the container
setup Run path, send with a cancellation-detached context, and flush
before Fatal exits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6969ba27-1c4f-48ee-bb1e-0f55bc6450f9

📥 Commits

Reviewing files that changed from the base of the PR and between 1672e8b and d25ac66.

📒 Files selected for processing (2)
  • e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer/postCreateCommand.sh
  • pkg/agent/tunnelserver/logger.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer/postCreateCommand.sh
  • pkg/agent/tunnelserver/logger.go

📝 Walkthrough

Walkthrough

The PR fixes dropped final log lines during lifecycle hook failures by introducing a queued, flushable logger. The tunnel logger now routes messages through a buffered channel, implements a Flusher interface to drain queued entries, and sends messages with a cancellation-detached context. Container setup defers Flush() before teardown, and a regression test validates the fix.

Changes

Tunnel Logger Flush Mechanism

Layer / File(s) Summary
Logger buffering and flush infrastructure
pkg/agent/tunnelserver/logger.go
Introduces buffered logEntry queue replacing direct tunnel.LogMessage writes; worker now handles log entries and flush acknowledgements with selective draining on context cancellation; exports Flusher interface and implements Flush() that blocks until queued entries are processed; new internal handle() sends messages via client.Log using a detached context with 5s timeout.
Logging methods enqueue integration
pkg/agent/tunnelserver/logger.go
All logging methods (Debug/Info/Warn/Error/Done/Fatal and f variants) now use internal enqueue() helper instead of direct channel writes; Fatal/Fatalf explicitly call Flush() before os.Exit(1); Done/Donef route through enqueue with DONE level.
Container setup flush deferral
cmd/agent/container/setup.go
SetupContainerCmd.Run conditionally defers Flush() when tunnel logger implements Flusher, ensuring queued messages reach the client before tunnel connection teardown.
E2E regression test and fixture
e2e/tests/up/handles_errors.go, e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer/devcontainer.json, e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer/postCreateCommand.sh
Ginkgo test exercises Docker provider workspace failing during lifecycle hook; provisioned workspace runs a fixture that emits stderr lines and exits non-zero; test asserts DEVPOD_LIFECYCLE_FLUSH_MARKER presence in combined output before cleanup, validating that final output is flushed before shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • skevetter/devpod#389: Both PRs modify cmd/agent/container/setup.go's tunnel setup/log handling flow and the integrated deferred flush relies on tunnel initialization patterns introduced nearby.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: flush tunnel logger before agent shutdown' clearly summarizes the main change: adding a flush mechanism to drain the tunnel logger before shutdown to prevent dropped log messages.
Linked Issues check ✅ Passed All coding requirements from issue #763 are met: Flush() method drains the logger channel, setup.go defers Flush() while context is alive, log sends use context.WithoutCancel to avoid cancellation drops, and a reproducible e2e test validates the fix.
Out of Scope Changes check ✅ Passed All changes directly address the root cause in issue #763: buffering via logChan with Flusher interface, context-detached sends, deferred flush on shutdown, and a regression test—no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/m label Jun 2, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the inline anonymous interface{ Flush() } assertion with a named
Flusher interface, plus a compile-time assertion that *tunnelLogger
implements it, so the contract is explicit and rename-safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@terenho-jobber terenho-jobber force-pushed the fix/tunnel-logger-flush-on-shutdown branch from f3d7db0 to 7e0944a Compare June 2, 2026 04:18
@terenho-jobber terenho-jobber marked this pull request as ready for review June 2, 2026 04:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/agent/tunnelserver/logger.go`:
- Line 30: The deferred Flush currently waits up to 5s per queued logEntry in
handle(), which with logChan (buffer size 1000) can hang shutdown for ~5000s;
change Flush/handle() to use a single overall deadline (e.g., create a context
with timeout or a single timer before draining) and loop draining logChan until
either the channel is empty or the overall deadline expires, then discard
remaining entries; update the logic referenced by logChan (buffer
initialization) and the drain/flush code in handle() and the Flush
implementation (also the code around lines 345-346) to stop per-entry timeouts
and enforce one capped timeout for the entire drain operation.
- Around line 86-97: Flush currently aborts when s.ctx is canceled because it
bails out of sending the logEntry flush marker and waiting for ack via
s.ctx.Done(); change Flush in tunnelLogger so it always enqueues a flush marker
and waits for the worker to acknowledge it even if s.ctx is canceled.
Concretely: stop using s.ctx.Done() to short-circuit the send and the ack wait;
instead ensure the flush marker (logEntry{flush: ack}) is sent (block until
successful or until an independent timeout) and then wait on ack or on a
dedicated worker-completion channel (e.g., a workerDone or closed done channel
set when the logger worker exits) or an explicit deadline; update Flush to wait
on ack || workerDone || timeout so deferred calls in SetupContainerCmd.Run and
Fatal/Fatalf truly block until queued lines are processed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7fffa07f-3b81-401c-aaf4-357e28973409

📥 Commits

Reviewing files that changed from the base of the PR and between 51a9828 and 7e0944a.

📒 Files selected for processing (2)
  • cmd/agent/container/setup.go
  • pkg/agent/tunnelserver/logger.go

Comment thread pkg/agent/tunnelserver/logger.go
Comment thread pkg/agent/tunnelserver/logger.go
@skevetter

Copy link
Copy Markdown
Owner

A test added to e2e/ capturing the issue would be helpful for detecting regressions.

Add an up-handle-errors e2e spec: a postCreateCommand prints a unique
marker to stderr then exits non-zero. The agent forwards lifecycle hook
output over the tunnel asynchronously, so on failure it must flush the
final queued line before tearing down. Asserts devpod up fails and the
marker still appears in the captured log output. Regression guard for
the bug fixed in this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@terenho-jobber terenho-jobber marked this pull request as draft June 2, 2026 16:48

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
e2e/tests/up/handles_errors.go (1)

89-95: ⚡ Quick win

Tighten the assertion to the JSON stderr stream.

Right now this still passes if the marker shows up as plain text or gets rerouted to stdout, even though the contract under test is --log-output=json preserving the final lifecycle stderr line. Validate the JSON stream here and search stderr directly.

Proposed change
 				stdout, stderr, err := f.DevPodUpStreams(ctx, tempDir, "--log-output=json")
 				framework.ExpectError(err, "expected lifecycle hook failure")
+				framework.ExpectNoError(verifyLogStream(strings.NewReader(stdout)))
+				framework.ExpectNoError(verifyLogStream(strings.NewReader(stderr)))
 				framework.ExpectNoError(
 					findMessage(
-						strings.NewReader(stdout+stderr),
+						strings.NewReader(stderr),
 						"DEVPOD_LIFECYCLE_FLUSH_MARKER",
 					),
 				)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/tests/up/handles_errors.go` around lines 89 - 95, The test currently
concatenates stdout+stderr and searches for the DEVPOD_LIFECYCLE_FLUSH_MARKER,
which can falsely pass if the marker appears in stdout; instead call
f.DevPodUpStreams(...) as before but assert against stderr only and validate it
as JSON (e.g., parse/ensure it's valid JSON and contains the expected final
lifecycle stderr line) when using "--log-output=json"; update the call sites
using findMessage to read from strings.NewReader(stderr) (or a JSON-aware
helper) and replace the current
ExpectNoError(findMessage(strings.NewReader(stdout+stderr),
"DEVPOD_LIFECYCLE_FLUSH_MARKER")) with an assertion that parses stderr JSON and
verifies the marker is present in the stderr log entry produced by
DevPodUpStreams.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@e2e/tests/up/handles_errors.go`:
- Around line 89-95: The test currently concatenates stdout+stderr and searches
for the DEVPOD_LIFECYCLE_FLUSH_MARKER, which can falsely pass if the marker
appears in stdout; instead call f.DevPodUpStreams(...) as before but assert
against stderr only and validate it as JSON (e.g., parse/ensure it's valid JSON
and contains the expected final lifecycle stderr line) when using
"--log-output=json"; update the call sites using findMessage to read from
strings.NewReader(stderr) (or a JSON-aware helper) and replace the current
ExpectNoError(findMessage(strings.NewReader(stdout+stderr),
"DEVPOD_LIFECYCLE_FLUSH_MARKER")) with an assertion that parses stderr JSON and
verifies the marker is present in the stderr log entry produced by
DevPodUpStreams.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0424563a-a1f0-4746-93b8-d6ac89dd9b7a

📥 Commits

Reviewing files that changed from the base of the PR and between 7e0944a and b2a189a.

📒 Files selected for processing (2)
  • e2e/tests/up/handles_errors.go
  • e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer.json
✅ Files skipped from review due to trivial changes (1)
  • e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer.json

terenho-jobber and others added 2 commits June 2, 2026 11:07
A lone final log line drains before teardown (the async worker has
~ms to send it), so the original single-line fixture passed even
without the flush fix. Emit a burst of stderr lines ending with the
marker so the sender is still busy when the hook fails, making the
dropped-last-line reliably reproduce without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
devpod's "failed to run <command>" error echoes the command verbatim,
so an inline marker matched the error text even when the hook's output
line was actually dropped (false positive). Move the burst+marker into
.devcontainer/postCreateCommand.sh so the command echo contains only the
script path; the marker can now only match a genuinely forwarded line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@terenho-jobber

terenho-jobber commented Jun 2, 2026

Copy link
Copy Markdown
Author

Added a regression test under e2e/tests/up (label up-handle-errors):

  • e2e/tests/up/testdata/docker-lifecycle-stderr/.devcontainer/ — a postCreateCommand script that prints a burst of stderr lines ending with a marker, then exit 1.
  • New spec in e2e/tests/up/handles_errors.go: runs devpod up --log-output=json, expects failure, and asserts the marker still appears in the forwarded output.

Evidence

Without the fix (agent's logger.go/setup.go reverted to main, test kept) — spec FAILS:
Test up-handle-errors run

[FAILED] forwards the final lifecycle hook log line before the agent exits [up-handle-errors]
  couldn't find message 'DEVPOD_LIFECYCLE_FLUSH_MARKER' in log
Ran 9 of 132 Specs — FAIL! -- 8 Passed | 1 Failed

(only 4 of 300 burst lines were delivered before teardown; the marker was dropped)

With the fix (this PR) — spec PASSES:
Test up-handle-errors run

forwards the final lifecycle hook log line before the agent exits [up-handle-errors]
Ran 9 of 132 Specs — SUCCESS! -- 9 Passed | 0 Failed

@terenho-jobber terenho-jobber marked this pull request as ready for review June 2, 2026 19:45
@terenho-jobber terenho-jobber requested a review from skevetter June 2, 2026 19:45
@terenho-jobber terenho-jobber marked this pull request as draft June 2, 2026 20:08
Flush previously short-circuited on s.ctx.Done(), so once the context
was cancelled it could skip enqueueing the flush marker or return before
the worker drained it — defeating the barrier on exactly the shutdown
path it guards (deferred flush in SetupContainerCmd.Run, Fatal/Fatalf).

Add a done channel closed when worker() exits and wait on that instead.
This keeps Flush a real barrier under cancellation while still avoiding
a deadlock when the worker has already stopped.
handle() applies a 5s timeout per message, and Flush waits for the whole
queue to drain, so a stalled-but-open tunnel with a full buffer could
block shutdown for buffer-size × 5s (~5000s). Cap the entire flush with
one 30s deadline; the worker (and the process) tear down afterward.
@terenho-jobber terenho-jobber marked this pull request as ready for review June 2, 2026 20:49
@skevetter skevetter merged commit 9ca6a35 into skevetter:main Jun 3, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Last stdout/stderr line of a lifecycle hook is dropped from devpod up output

2 participants