You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up hardening surfaced while shipping the #749 supervisor-readiness fixes. None of these are required for the cold-boot mention-loop fix (that landed in #755 and #761), but each closes a residual failure or test gap. Grouping them so they can be picked up together or à la carte.
Attach-path hang windows (same class as the readiness-probe bug)
The #749 fix added a per-probe timeout to the workspace readiness probe because a fetch that is accepted but never answered hangs forever. The same pattern exists on two other paths:
Gateway SDK calls don't carry the run abort signal.packages/gateway/src/execute/run-core.ts receives a signal (used only for event-stream iteration), but client.session.create(), client.event.subscribe(), and client.session.promptAsync() are called without it. If the workspace proxy accepts the connection but stalls, a mention run can hang before the iteration loop ever observes signal.aborted. Thread the run signal (and/or a bounded AbortSignal.timeout) into those calls.
The workspace OpenCode proxy has no upstream request timeout.apps/workspace-agent/src/opencode-proxy.ts forwards via http.request() and only handles the error event — a stalled upstream leaves the gateway call hanging. Add an upstream timeout for non-streaming requests. Careful: the proxy also carries the SSE event stream, so a short total timeout must NOT be applied to event streams — only to ordinary request/response calls.
The overall readiness timeout is not a hard cap. A single probe can overshoot the configured WORKSPACE_OPENCODE_READY_TIMEOUT_MS by up to one per-probe timeout (~3s). Immaterial at the 60s default; only matters if an operator sets a very low overall timeout. Cap each probe to the remaining deadline if a hard bound is wanted.
Test gaps
No cross-package type-mirror test.apps/workspace-agent/src/types.tsReadyzResponse (flat) and packages/gateway/src/workspace-api/types.ts (discriminated union) are wire-compatible by hand. A compile-time equivalence/assignability test would catch future drift before runtime.
No entrypoint-wiring test for the readiness timeout.apps/workspace-agent/src/main.ts is side-effectful (binds ports at import), so there's no clean seam to assert the resolved WORKSPACE_OPENCODE_READY_TIMEOUT_MS reaches startOpencodeServer. Extracting a small startWorkspaceAgent(deps) would make the wiring testable.
Priority is low — these are resilience and coverage improvements on top of a working fix.
Follow-up hardening surfaced while shipping the #749 supervisor-readiness fixes. None of these are required for the cold-boot mention-loop fix (that landed in #755 and #761), but each closes a residual failure or test gap. Grouping them so they can be picked up together or à la carte.
Attach-path hang windows (same class as the readiness-probe bug)
The #749 fix added a per-probe timeout to the workspace readiness probe because a fetch that is accepted but never answered hangs forever. The same pattern exists on two other paths:
packages/gateway/src/execute/run-core.tsreceives asignal(used only for event-stream iteration), butclient.session.create(),client.event.subscribe(), andclient.session.promptAsync()are called without it. If the workspace proxy accepts the connection but stalls, a mention run can hang before the iteration loop ever observessignal.aborted. Thread the run signal (and/or a boundedAbortSignal.timeout) into those calls.apps/workspace-agent/src/opencode-proxy.tsforwards viahttp.request()and only handles theerrorevent — a stalled upstream leaves the gateway call hanging. Add an upstream timeout for non-streaming requests. Careful: the proxy also carries the SSE event stream, so a short total timeout must NOT be applied to event streams — only to ordinary request/response calls.Readiness depth
/readyzreflects only the loopback OpenCode status, not the path the gateway actually attaches through. Readiness currently mirrors the supervisor'sopencodeStatus(loopback:54321). The gateway attaches via the bearer proxy on:9200. Consider gating readiness on the proxy listening/healthy as well, so/readyzmeans "the attach path is usable" rather than "OpenCode booted." (The post-ready-exit case — OpenCode dying after reachingready— is handled by the supervisor respawn work tracked under v0.52.1: workspace-agent OpenCode supervisor is brittle — cold-boot mention runs fail (15s one-shot timeout, no per-probe timeout, no retry, /healthz masks dead OpenCode) #749.)WORKSPACE_OPENCODE_READY_TIMEOUT_MSby up to one per-probe timeout (~3s). Immaterial at the 60s default; only matters if an operator sets a very low overall timeout. Cap each probe to the remaining deadline if a hard bound is wanted.Test gaps
apps/workspace-agent/src/types.tsReadyzResponse(flat) andpackages/gateway/src/workspace-api/types.ts(discriminated union) are wire-compatible by hand. A compile-time equivalence/assignability test would catch future drift before runtime.apps/workspace-agent/src/main.tsis side-effectful (binds ports at import), so there's no clean seam to assert the resolvedWORKSPACE_OPENCODE_READY_TIMEOUT_MSreachesstartOpencodeServer. Extracting a smallstartWorkspaceAgent(deps)would make the wiring testable.Priority is low — these are resilience and coverage improvements on top of a working fix.