Gateway operator runs can timeout after emitting valid output

## Problem

A web operator run can emit valid output and then end as `FAILED` when the Gateway hard run timeout fires before OpenCode emits `session.idle`.

Observed run:

- `runId`: `64c64589-7da9-48e1-893e-6c2585668fbd`
- `entityRef`: `marcusrbrown/systematic`
- `surface`: `web`
- stream advertised `contractVersion: "1.5.0"`
- stream emitted normal output frames through final output `seq:52`
- stream then emitted terminal status:
  - `phase: "FAILED"`
  - `status: "failed"`

Gateway logs for the run show this was a timeout, not dashboard contract drift:

- `run-core: stream ended due to timeout signal`
- `run: execution failed`
- `kind: "timeout"`
- `err: "Run timed out: event stream aborted by timeout signal"`

The dashboard now accepts/rendered the `1.5.0` stream and production serves the corrected runtime, so this remaining failure belongs to Gateway/operator execution policy.

## Source findings

Relevant current behavior:

- `packages/gateway/src/config.ts` parses `GATEWAY_RUN_TIMEOUT_MS`, defaulting to `600000` ms.
- `packages/gateway/src/execute/run.ts` computes remaining budget from run start time, creates `AbortSignal.timeout(remainingBudgetMs)`, and passes it into `runOpenCodeCore(...)`.
- `packages/gateway/src/execute/run.ts` derives approval deadlines from the same remaining budget; approvals are capped below the hard abort.
- `packages/gateway/src/execute/run-core.ts` treats an abort after output but before `session.idle` as `RunCoreError('timeout')`.
- The SSE manager exposes terminal `status` and `output` frames; there is no separate operator-facing error-detail event today.

Infra finding:

- `marcusrbrown/infra` currently exposes `WORKSPACE_OPENCODE_READY_TIMEOUT_MS`, but does not expose or document `GATEWAY_RUN_TIMEOUT_MS` in the Gateway deploy flow.
- Changing the Gateway run timeout policy likely belongs in this repo first; infra can then expose the knob if needed.

## Why this matters

The operator sees useful output, but the run still lands in a terminal failure state because the current hard timeout is elapsed-wall-clock based and requires `session.idle` before success. Longer systematic/onboarding-style runs can therefore look successful from output, then fail at the Gateway lifecycle layer.

## Candidate directions

A dedicated implementation session should evaluate one of these rather than patching the dashboard:

1. Raise or make configurable the Gateway run timeout for operator/web runs.
2. Keep a hard cap, but make timeout activity-aware so active output/progress is not aborted solely for exceeding 10 minutes.
3. Add an explicit resumable/continue path before aborting long-running operator work.
4. Split human approval wait time from compute/output time so approval latency does not consume the same execution budget.
5. Improve operator-facing failure detail if timeout remains the correct outcome, without leaking sensitive run internals.

## Constraints

- Do not remove the hard timeout entirely; hung streams must still release slots and fail closed.
- Preserve `session.idle` as the success signal unless a deliberate replacement is designed.
- Preserve approval deadline safety, including Discord interaction-token limits.
- Keep operator-visible error detail sanitized.
- If a new config knob is added, document the infra follow-up required to expose it in deployment.

## Acceptance criteria

- Long-running web operator runs that continue producing output do not fail only because the existing 10-minute wall-clock budget elapsed, unless that remains an explicit product decision.
- Timeout behavior remains bounded and releases all Gateway resources.
- Timeout/failure copy makes the operator state understandable without exposing secrets or internal paths.
- Tests cover output-before-timeout, stream-ended-before-idle, approval-deadline interaction, and configured timeout behavior.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway operator runs can timeout after emitting valid output #1055

Problem

Source findings

Why this matters

Candidate directions

Constraints

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gateway operator runs can timeout after emitting valid output #1055

Description

Problem

Source findings

Why this matters

Candidate directions

Constraints

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions