Problem
When coding-orchestrator is used to run or manage an agent while the project is being edited, a hot reload or process restart can interrupt the run after a tool call completes but before the orchestrator has fully persisted or reconciled the agent state.
This surfaced while investigating a run targeting /Users/germanescobar/Projects/pocs/coding-agent: the target agent's event log had a completed tool_result, but the persisted session snapshot lagged behind. From the orchestrator user's perspective, this looks like the orchestrator lost track of the agent call after reload.
Why this matters
coding-orchestrator should be able to orchestrate work on itself and on sibling projects. If editing/reloading the orchestrator can strand active calls or leave persisted state inconsistent, self-improvement workflows become fragile.
Failure mode to account for
- Orchestrator starts or resumes an agent run.
- The agent/model emits a tool call.
- The tool call completes and some event/state is written.
- Hot reload or process restart occurs before the orchestrator records the next durable checkpoint.
- The UI/runtime resumes from stale state and no longer knows whether the call is pending, completed, or needs reconciliation.
Possible fixes
- Persist an orchestrator-side durable checkpoint before dispatching an agent/tool call.
- Persist each tool result or agent event as soon as it is observed, not only after a full run step finishes.
- On server startup or session resume, reconcile active runs from the latest stored event stream and session snapshot.
- Mark interrupted in-flight runs explicitly when possible, then allow recovery/retry from the last known durable step.
- Add tests for restart/reload between tool result persistence and the next model/agent step.
Acceptance criteria
- A hot reload or server restart does not cause the orchestrator to lose track of an active agent call.
- Resuming an orchestrated session shows the last completed tool call/result accurately.
- Stale or divergent target-agent state is detected and surfaced instead of silently continuing from the wrong point.
- Regression coverage exists for the interrupted-after-tool-result scenario.
Problem
When
coding-orchestratoris used to run or manage an agent while the project is being edited, a hot reload or process restart can interrupt the run after a tool call completes but before the orchestrator has fully persisted or reconciled the agent state.This surfaced while investigating a run targeting
/Users/germanescobar/Projects/pocs/coding-agent: the target agent's event log had a completedtool_result, but the persisted session snapshot lagged behind. From the orchestrator user's perspective, this looks like the orchestrator lost track of the agent call after reload.Why this matters
coding-orchestratorshould be able to orchestrate work on itself and on sibling projects. If editing/reloading the orchestrator can strand active calls or leave persisted state inconsistent, self-improvement workflows become fragile.Failure mode to account for
Possible fixes
Acceptance criteria