Skip to content

Workspace/gateway reliability hardening: attach-path timeouts + readiness depth #763

Description

@marcusrbrown

Follow-up hardening surfaced while shipping the #749 supervisor-readiness fixes. None of these are required for the cold-boot mention-loop fix (that landed in #755 and #761), but each closes a residual failure or test gap. Grouping them so they can be picked up together or à la carte.

Attach-path hang windows (same class as the readiness-probe bug)

The #749 fix added a per-probe timeout to the workspace readiness probe because a fetch that is accepted but never answered hangs forever. The same pattern exists on two other paths:

  • Gateway SDK calls don't carry the run abort signal. packages/gateway/src/execute/run-core.ts receives a signal (used only for event-stream iteration), but client.session.create(), client.event.subscribe(), and client.session.promptAsync() are called without it. If the workspace proxy accepts the connection but stalls, a mention run can hang before the iteration loop ever observes signal.aborted. Thread the run signal (and/or a bounded AbortSignal.timeout) into those calls.
  • The workspace OpenCode proxy has no upstream request timeout. apps/workspace-agent/src/opencode-proxy.ts forwards via http.request() and only handles the error event — a stalled upstream leaves the gateway call hanging. Add an upstream timeout for non-streaming requests. Careful: the proxy also carries the SSE event stream, so a short total timeout must NOT be applied to event streams — only to ordinary request/response calls.

Readiness depth

  • /readyz reflects only the loopback OpenCode status, not the path the gateway actually attaches through. Readiness currently mirrors the supervisor's opencodeStatus (loopback :54321). The gateway attaches via the bearer proxy on :9200. Consider gating readiness on the proxy listening/healthy as well, so /readyz means "the attach path is usable" rather than "OpenCode booted." (The post-ready-exit case — OpenCode dying after reaching ready — is handled by the supervisor respawn work tracked under v0.52.1: workspace-agent OpenCode supervisor is brittle — cold-boot mention runs fail (15s one-shot timeout, no per-probe timeout, no retry, /healthz masks dead OpenCode) #749.)
  • The overall readiness timeout is not a hard cap. A single probe can overshoot the configured WORKSPACE_OPENCODE_READY_TIMEOUT_MS by up to one per-probe timeout (~3s). Immaterial at the 60s default; only matters if an operator sets a very low overall timeout. Cap each probe to the remaining deadline if a hard bound is wanted.

Test gaps

  • No cross-package type-mirror test. apps/workspace-agent/src/types.ts ReadyzResponse (flat) and packages/gateway/src/workspace-api/types.ts (discriminated union) are wire-compatible by hand. A compile-time equivalence/assignability test would catch future drift before runtime.
  • No entrypoint-wiring test for the readiness timeout. apps/workspace-agent/src/main.ts is side-effectful (binds ports at import), so there's no clean seam to assert the resolved WORKSPACE_OPENCODE_READY_TIMEOUT_MS reaches startOpencodeServer. Extracting a small startWorkspaceAgent(deps) would make the wiring testable.

Priority is low — these are resilience and coverage improvements on top of a working fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions