You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stream-idle timeouts are the dominant failure class for long-running stages (#549, today's session, recurring `wait-copilot` bails). The watchdog exists because live streaming requires a healthy idle window; the upstream service occasionally goes quiet, the watchdog fires, the stage bails. Increasing timeouts and adding backoff (#549) is a partial mitigation, not a cure.
Anthropic's Message Batches API is the structural answer for any agent invocation whose latency doesn't matter: half the cost, 24-hour SLA, no idle window concept. The watchdog class of failure does not exist there.
Most agent invocations in a gremlin pipeline don't need real-time turnaround:
`plan`, `review_code`, `handoff` — downstream work doesn't start until the agent finishes anyway. Latency is irrelevant.
A `mode: Literal["live", "batch"]` parameter on the agent call lets each stage pick.
Design questions to settle before coding
Where does the mode parameter live?
Candidates: `Client.run(...)` kwarg; a per-call `mode` field on whatever struct the stage hands the client; a stage-level constant in `gremlins/stages/.py`. Pros/cons of each.
What does the runtime do while a batch is outstanding?
The gremlin process can't `sleep(24h)` — that ties up a worker for nothing. Options:
Checkpoint state and exit; a separate cron/poller relaunches the gremlin when the batch lands.
Park the process and poll the batch API every N minutes.
Use webhooks if Anthropic offers one (check API capabilities).
The choice affects `run_pipeline.py`, `launcher.resume()`, the fleet manager's liveness model, and `gremlins` (status) output.
Liveness model in the fleet manager.
Today: `running`, `finished`, `dead:exit N`, `bailed`. Batch needs a new state, e.g. `waiting-on-batch` or `parked`. `gremlins` (status) needs to display it; `gremlins rescue` needs to know not to re-spawn a parked gremlin; `gremlins stop` needs to cancel an outstanding batch.
Per-provider scope.
First-cut implementation in `gremlins/clients/claude.py` only; `CopilotClient` falls back to live (it doesn't stream anyway). Out-of-tree providers (OpenAI agents in `gremlins/clients/providers/`) get `NotImplementedError` on batch mode until they're wired.
Cost accounting.
`total_cost_usd` in state.json needs to reflect batch pricing. Anthropic returns cost in the batch result; check the field name and convert.
Failure modes.
Batch can: succeed, partially succeed (some messages in the batch failed), expire (24h SLA exceeded), be cancelled. Each needs handling. For a single-message batch (which is what gremlins would submit per agent call), most of these collapse — but the design should be deliberate.
Scope of the first implementation
After the design questions are answered, the first PR should be the minimum that produces a working batch invocation end-to-end:
`Client.run(..., mode="batch")` on `ClaudeClient` submits via the Batches API and either blocks (option B above) or checkpoints (option A).
One stage opted in (suggested: `plan`, since it's the simplest and never iterates).
Fleet status surfaces whatever new liveness state batch needs.
Cost accounting works.
Subsequent PRs opt additional stages in (`review_code`, `handoff`, `address_code`) once the seam is proven.
Motivation
Stream-idle timeouts are the dominant failure class for long-running stages (#549, today's session, recurring `wait-copilot` bails). The watchdog exists because live streaming requires a healthy idle window; the upstream service occasionally goes quiet, the watchdog fires, the stage bails. Increasing timeouts and adding backoff (#549) is a partial mitigation, not a cure.
Anthropic's Message Batches API is the structural answer for any agent invocation whose latency doesn't matter: half the cost, 24-hour SLA, no idle window concept. The watchdog class of failure does not exist there.
Most agent invocations in a gremlin pipeline don't need real-time turnaround:
A `mode: Literal["live", "batch"]` parameter on the agent call lets each stage pick.
Design questions to settle before coding
Where does the mode parameter live?
Candidates: `Client.run(...)` kwarg; a per-call `mode` field on whatever struct the stage hands the client; a stage-level constant in `gremlins/stages/.py`. Pros/cons of each.
What does the runtime do while a batch is outstanding?
The gremlin process can't `sleep(24h)` — that ties up a worker for nothing. Options:
The choice affects `run_pipeline.py`, `launcher.resume()`, the fleet manager's liveness model, and `gremlins` (status) output.
Liveness model in the fleet manager.
Today: `running`, `finished`, `dead:exit N`, `bailed`. Batch needs a new state, e.g. `waiting-on-batch` or `parked`. `gremlins` (status) needs to display it; `gremlins rescue` needs to know not to re-spawn a parked gremlin; `gremlins stop` needs to cancel an outstanding batch.
Per-provider scope.
First-cut implementation in `gremlins/clients/claude.py` only; `CopilotClient` falls back to live (it doesn't stream anyway). Out-of-tree providers (OpenAI agents in `gremlins/clients/providers/`) get `NotImplementedError` on batch mode until they're wired.
Cost accounting.
`total_cost_usd` in state.json needs to reflect batch pricing. Anthropic returns cost in the batch result; check the field name and convert.
Failure modes.
Batch can: succeed, partially succeed (some messages in the batch failed), expire (24h SLA exceeded), be cancelled. Each needs handling. For a single-message batch (which is what gremlins would submit per agent call), most of these collapse — but the design should be deliberate.
Scope of the first implementation
After the design questions are answered, the first PR should be the minimum that produces a working batch invocation end-to-end:
Subsequent PRs opt additional stages in (`review_code`, `handoff`, `address_code`) once the seam is proven.
Out of scope