Problem
A web operator run can emit valid output and then end as FAILED when the Gateway hard run timeout fires before OpenCode emits session.idle.
Observed run:
runId: 64c64589-7da9-48e1-893e-6c2585668fbd
entityRef: marcusrbrown/systematic
surface: web
- stream advertised
contractVersion: "1.5.0"
- stream emitted normal output frames through final output
seq:52
- stream then emitted terminal status:
phase: "FAILED"
status: "failed"
Gateway logs for the run show this was a timeout, not dashboard contract drift:
run-core: stream ended due to timeout signal
run: execution failed
kind: "timeout"
err: "Run timed out: event stream aborted by timeout signal"
The dashboard now accepts/rendered the 1.5.0 stream and production serves the corrected runtime, so this remaining failure belongs to Gateway/operator execution policy.
Source findings
Relevant current behavior:
packages/gateway/src/config.ts parses GATEWAY_RUN_TIMEOUT_MS, defaulting to 600000 ms.
packages/gateway/src/execute/run.ts computes remaining budget from run start time, creates AbortSignal.timeout(remainingBudgetMs), and passes it into runOpenCodeCore(...).
packages/gateway/src/execute/run.ts derives approval deadlines from the same remaining budget; approvals are capped below the hard abort.
packages/gateway/src/execute/run-core.ts treats an abort after output but before session.idle as RunCoreError('timeout').
- The SSE manager exposes terminal
status and output frames; there is no separate operator-facing error-detail event today.
Infra finding:
marcusrbrown/infra currently exposes WORKSPACE_OPENCODE_READY_TIMEOUT_MS, but does not expose or document GATEWAY_RUN_TIMEOUT_MS in the Gateway deploy flow.
- Changing the Gateway run timeout policy likely belongs in this repo first; infra can then expose the knob if needed.
Why this matters
The operator sees useful output, but the run still lands in a terminal failure state because the current hard timeout is elapsed-wall-clock based and requires session.idle before success. Longer systematic/onboarding-style runs can therefore look successful from output, then fail at the Gateway lifecycle layer.
Candidate directions
A dedicated implementation session should evaluate one of these rather than patching the dashboard:
- Raise or make configurable the Gateway run timeout for operator/web runs.
- Keep a hard cap, but make timeout activity-aware so active output/progress is not aborted solely for exceeding 10 minutes.
- Add an explicit resumable/continue path before aborting long-running operator work.
- Split human approval wait time from compute/output time so approval latency does not consume the same execution budget.
- Improve operator-facing failure detail if timeout remains the correct outcome, without leaking sensitive run internals.
Constraints
- Do not remove the hard timeout entirely; hung streams must still release slots and fail closed.
- Preserve
session.idle as the success signal unless a deliberate replacement is designed.
- Preserve approval deadline safety, including Discord interaction-token limits.
- Keep operator-visible error detail sanitized.
- If a new config knob is added, document the infra follow-up required to expose it in deployment.
Acceptance criteria
- Long-running web operator runs that continue producing output do not fail only because the existing 10-minute wall-clock budget elapsed, unless that remains an explicit product decision.
- Timeout behavior remains bounded and releases all Gateway resources.
- Timeout/failure copy makes the operator state understandable without exposing secrets or internal paths.
- Tests cover output-before-timeout, stream-ended-before-idle, approval-deadline interaction, and configured timeout behavior.
Problem
A web operator run can emit valid output and then end as
FAILEDwhen the Gateway hard run timeout fires before OpenCode emitssession.idle.Observed run:
runId:64c64589-7da9-48e1-893e-6c2585668fbdentityRef:marcusrbrown/systematicsurface:webcontractVersion: "1.5.0"seq:52phase: "FAILED"status: "failed"Gateway logs for the run show this was a timeout, not dashboard contract drift:
run-core: stream ended due to timeout signalrun: execution failedkind: "timeout"err: "Run timed out: event stream aborted by timeout signal"The dashboard now accepts/rendered the
1.5.0stream and production serves the corrected runtime, so this remaining failure belongs to Gateway/operator execution policy.Source findings
Relevant current behavior:
packages/gateway/src/config.tsparsesGATEWAY_RUN_TIMEOUT_MS, defaulting to600000ms.packages/gateway/src/execute/run.tscomputes remaining budget from run start time, createsAbortSignal.timeout(remainingBudgetMs), and passes it intorunOpenCodeCore(...).packages/gateway/src/execute/run.tsderives approval deadlines from the same remaining budget; approvals are capped below the hard abort.packages/gateway/src/execute/run-core.tstreats an abort after output but beforesession.idleasRunCoreError('timeout').statusandoutputframes; there is no separate operator-facing error-detail event today.Infra finding:
marcusrbrown/infracurrently exposesWORKSPACE_OPENCODE_READY_TIMEOUT_MS, but does not expose or documentGATEWAY_RUN_TIMEOUT_MSin the Gateway deploy flow.Why this matters
The operator sees useful output, but the run still lands in a terminal failure state because the current hard timeout is elapsed-wall-clock based and requires
session.idlebefore success. Longer systematic/onboarding-style runs can therefore look successful from output, then fail at the Gateway lifecycle layer.Candidate directions
A dedicated implementation session should evaluate one of these rather than patching the dashboard:
Constraints
session.idleas the success signal unless a deliberate replacement is designed.Acceptance criteria