Skip to content

Gateway operator runs can timeout after emitting valid output #1055

Description

@marcusrbrown

Problem

A web operator run can emit valid output and then end as FAILED when the Gateway hard run timeout fires before OpenCode emits session.idle.

Observed run:

  • runId: 64c64589-7da9-48e1-893e-6c2585668fbd
  • entityRef: marcusrbrown/systematic
  • surface: web
  • stream advertised contractVersion: "1.5.0"
  • stream emitted normal output frames through final output seq:52
  • stream then emitted terminal status:
    • phase: "FAILED"
    • status: "failed"

Gateway logs for the run show this was a timeout, not dashboard contract drift:

  • run-core: stream ended due to timeout signal
  • run: execution failed
  • kind: "timeout"
  • err: "Run timed out: event stream aborted by timeout signal"

The dashboard now accepts/rendered the 1.5.0 stream and production serves the corrected runtime, so this remaining failure belongs to Gateway/operator execution policy.

Source findings

Relevant current behavior:

  • packages/gateway/src/config.ts parses GATEWAY_RUN_TIMEOUT_MS, defaulting to 600000 ms.
  • packages/gateway/src/execute/run.ts computes remaining budget from run start time, creates AbortSignal.timeout(remainingBudgetMs), and passes it into runOpenCodeCore(...).
  • packages/gateway/src/execute/run.ts derives approval deadlines from the same remaining budget; approvals are capped below the hard abort.
  • packages/gateway/src/execute/run-core.ts treats an abort after output but before session.idle as RunCoreError('timeout').
  • The SSE manager exposes terminal status and output frames; there is no separate operator-facing error-detail event today.

Infra finding:

  • marcusrbrown/infra currently exposes WORKSPACE_OPENCODE_READY_TIMEOUT_MS, but does not expose or document GATEWAY_RUN_TIMEOUT_MS in the Gateway deploy flow.
  • Changing the Gateway run timeout policy likely belongs in this repo first; infra can then expose the knob if needed.

Why this matters

The operator sees useful output, but the run still lands in a terminal failure state because the current hard timeout is elapsed-wall-clock based and requires session.idle before success. Longer systematic/onboarding-style runs can therefore look successful from output, then fail at the Gateway lifecycle layer.

Candidate directions

A dedicated implementation session should evaluate one of these rather than patching the dashboard:

  1. Raise or make configurable the Gateway run timeout for operator/web runs.
  2. Keep a hard cap, but make timeout activity-aware so active output/progress is not aborted solely for exceeding 10 minutes.
  3. Add an explicit resumable/continue path before aborting long-running operator work.
  4. Split human approval wait time from compute/output time so approval latency does not consume the same execution budget.
  5. Improve operator-facing failure detail if timeout remains the correct outcome, without leaking sensitive run internals.

Constraints

  • Do not remove the hard timeout entirely; hung streams must still release slots and fail closed.
  • Preserve session.idle as the success signal unless a deliberate replacement is designed.
  • Preserve approval deadline safety, including Discord interaction-token limits.
  • Keep operator-visible error detail sanitized.
  • If a new config knob is added, document the infra follow-up required to expose it in deployment.

Acceptance criteria

  • Long-running web operator runs that continue producing output do not fail only because the existing 10-minute wall-clock budget elapsed, unless that remains an explicit product decision.
  • Timeout behavior remains bounded and releases all Gateway resources.
  • Timeout/failure copy makes the operator state understandable without exposing secrets or internal paths.
  • Tests cover output-before-timeout, stream-ended-before-idle, approval-deadline interaction, and configured timeout behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions