Skip to content

Make orchestrated agent runs resilient to hot reload and process restarts #54

@germanescobar

Description

@germanescobar

Problem

When coding-orchestrator is used to run or manage an agent while the project is being edited, a hot reload or process restart can interrupt the run after a tool call completes but before the orchestrator has fully persisted or reconciled the agent state.

This surfaced while investigating a run targeting /Users/germanescobar/Projects/pocs/coding-agent: the target agent's event log had a completed tool_result, but the persisted session snapshot lagged behind. From the orchestrator user's perspective, this looks like the orchestrator lost track of the agent call after reload.

Why this matters

coding-orchestrator should be able to orchestrate work on itself and on sibling projects. If editing/reloading the orchestrator can strand active calls or leave persisted state inconsistent, self-improvement workflows become fragile.

Failure mode to account for

  • Orchestrator starts or resumes an agent run.
  • The agent/model emits a tool call.
  • The tool call completes and some event/state is written.
  • Hot reload or process restart occurs before the orchestrator records the next durable checkpoint.
  • The UI/runtime resumes from stale state and no longer knows whether the call is pending, completed, or needs reconciliation.

Possible fixes

  • Persist an orchestrator-side durable checkpoint before dispatching an agent/tool call.
  • Persist each tool result or agent event as soon as it is observed, not only after a full run step finishes.
  • On server startup or session resume, reconcile active runs from the latest stored event stream and session snapshot.
  • Mark interrupted in-flight runs explicitly when possible, then allow recovery/retry from the last known durable step.
  • Add tests for restart/reload between tool result persistence and the next model/agent step.

Acceptance criteria

  • A hot reload or server restart does not cause the orchestrator to lose track of an active agent call.
  • Resuming an orchestrated session shows the last completed tool call/result accurately.
  • Stale or divergent target-agent state is detected and surfaced instead of silently continuing from the wrong point.
  • Regression coverage exists for the interrupted-after-tool-result scenario.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions