Make orchestrated agent runs resilient to hot reload and process restarts

## Problem

When `coding-orchestrator` is used to run or manage an agent while the project is being edited, a hot reload or process restart can interrupt the run after a tool call completes but before the orchestrator has fully persisted or reconciled the agent state.

This surfaced while investigating a run targeting `/Users/germanescobar/Projects/pocs/coding-agent`: the target agent's event log had a completed `tool_result`, but the persisted session snapshot lagged behind. From the orchestrator user's perspective, this looks like the orchestrator lost track of the agent call after reload.

## Why this matters

`coding-orchestrator` should be able to orchestrate work on itself and on sibling projects. If editing/reloading the orchestrator can strand active calls or leave persisted state inconsistent, self-improvement workflows become fragile.

## Failure mode to account for

- Orchestrator starts or resumes an agent run.
- The agent/model emits a tool call.
- The tool call completes and some event/state is written.
- Hot reload or process restart occurs before the orchestrator records the next durable checkpoint.
- The UI/runtime resumes from stale state and no longer knows whether the call is pending, completed, or needs reconciliation.

## Possible fixes

- Persist an orchestrator-side durable checkpoint before dispatching an agent/tool call.
- Persist each tool result or agent event as soon as it is observed, not only after a full run step finishes.
- On server startup or session resume, reconcile active runs from the latest stored event stream and session snapshot.
- Mark interrupted in-flight runs explicitly when possible, then allow recovery/retry from the last known durable step.
- Add tests for restart/reload between tool result persistence and the next model/agent step.

## Acceptance criteria

- A hot reload or server restart does not cause the orchestrator to lose track of an active agent call.
- Resuming an orchestrated session shows the last completed tool call/result accurately.
- Stale or divergent target-agent state is detected and surfaced instead of silently continuing from the wrong point.
- Regression coverage exists for the interrupted-after-tool-result scenario.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make orchestrated agent runs resilient to hot reload and process restarts #54

Problem

Why this matters

Failure mode to account for

Possible fixes

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make orchestrated agent runs resilient to hot reload and process restarts #54

Description

Problem

Why this matters

Failure mode to account for

Possible fixes

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions