Skip to content

Design: agent batch mode (Anthropic Message Batches API) #559

@xbrianh

Description

@xbrianh

Motivation

Stream-idle timeouts are the dominant failure class for long-running stages (#549, today's session, recurring `wait-copilot` bails). The watchdog exists because live streaming requires a healthy idle window; the upstream service occasionally goes quiet, the watchdog fires, the stage bails. Increasing timeouts and adding backoff (#549) is a partial mitigation, not a cure.

Anthropic's Message Batches API is the structural answer for any agent invocation whose latency doesn't matter: half the cost, 24-hour SLA, no idle window concept. The watchdog class of failure does not exist there.

Most agent invocations in a gremlin pipeline don't need real-time turnaround:

  • `plan`, `review_code`, `handoff` — downstream work doesn't start until the agent finishes anyway. Latency is irrelevant.
  • `address_code` — same; the next stage waits.
  • `implement`, `verify-fix` — latency-sensitive (the fix loop wants quick iteration). Live.

A `mode: Literal["live", "batch"]` parameter on the agent call lets each stage pick.

Design questions to settle before coding

  1. Where does the mode parameter live?
    Candidates: `Client.run(...)` kwarg; a per-call `mode` field on whatever struct the stage hands the client; a stage-level constant in `gremlins/stages/.py`. Pros/cons of each.

  2. What does the runtime do while a batch is outstanding?
    The gremlin process can't `sleep(24h)` — that ties up a worker for nothing. Options:

    • Checkpoint state and exit; a separate cron/poller relaunches the gremlin when the batch lands.
    • Park the process and poll the batch API every N minutes.
    • Use webhooks if Anthropic offers one (check API capabilities).
      The choice affects `run_pipeline.py`, `launcher.resume()`, the fleet manager's liveness model, and `gremlins` (status) output.
  3. Liveness model in the fleet manager.
    Today: `running`, `finished`, `dead:exit N`, `bailed`. Batch needs a new state, e.g. `waiting-on-batch` or `parked`. `gremlins` (status) needs to display it; `gremlins rescue` needs to know not to re-spawn a parked gremlin; `gremlins stop` needs to cancel an outstanding batch.

  4. Per-provider scope.
    First-cut implementation in `gremlins/clients/claude.py` only; `CopilotClient` falls back to live (it doesn't stream anyway). Out-of-tree providers (OpenAI agents in `gremlins/clients/providers/`) get `NotImplementedError` on batch mode until they're wired.

  5. Cost accounting.
    `total_cost_usd` in state.json needs to reflect batch pricing. Anthropic returns cost in the batch result; check the field name and convert.

  6. Failure modes.
    Batch can: succeed, partially succeed (some messages in the batch failed), expire (24h SLA exceeded), be cancelled. Each needs handling. For a single-message batch (which is what gremlins would submit per agent call), most of these collapse — but the design should be deliberate.

Scope of the first implementation

After the design questions are answered, the first PR should be the minimum that produces a working batch invocation end-to-end:

  • `Client.run(..., mode="batch")` on `ClaudeClient` submits via the Batches API and either blocks (option B above) or checkpoints (option A).
  • One stage opted in (suggested: `plan`, since it's the simplest and never iterates).
  • Fleet status surfaces whatever new liveness state batch needs.
  • Cost accounting works.

Subsequent PRs opt additional stages in (`review_code`, `handoff`, `address_code`) once the seam is proven.

Out of scope

  • The primitives/recipes restructure (Design: primitives + recipes + user pipelines #558, closed).
  • Batch support for the copilot client (it doesn't stream).
  • Cross-gremlin batch coalescing (submit one batch for many gremlins' calls). Worth exploring later but explicitly not in v1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions