Skip to content

replace_subtree: a replacement node reusing the target (or any removed) id inherits its committed state and never re-executes #190

@duncanita

Description

@duncanita

Summary

When a replace_subtree mutation supplies a replacement node that reuses an id from the removed set (the target id, or any exclusive descendant of the target), that node silently inherits the removed node's committed result projection in the appended revision and is never re-executed — even though replace_subtree has just rewired new predecessors into it. The caller asked to replace the subtree and got inheritance instead.

Where

DefinitionEditor#plan_replace_subtree (lib/dag/definition_editor.rb:54-75, current main @ v1.5.0; behaviour also present in the v1.4 contract):

removed   = definition.exclusive_descendants_of(target, include_self: true)
impacted  = definition.descendants_of(target, include_self: true) - removed
kept_nodes = definition.nodes - removed
conflicts  = kept_nodes & replacement_nodes
raise ArgumentError, "replacement nodes collide with preserved nodes: ..." unless conflicts.empty?
...
DAG::PlanResult.valid(new_definition:, invalidated_node_ids: impacted)
  • The collision guard only rejects replacement ids that collide with kept_nodes (preserved). A replacement id that equals the target (or any other id in removed) is not a conflict and is allowed.
  • invalidated_node_ids = impacted = descendants_of(target) - removed. The reused target id is in removed, so it is not in invalidated_node_ids.
  • On revision append, a node id that existed before, exists after, and is not in invalidated_node_ids carries its committed node-state / committed-result projection forward. So the "replacement" node keeps the stale result and is treated as already done.

Net effect: the replacement graph is spliced in, the new entry edges feed the reused-id node, but the kernel never schedules it because it looks already-committed.

Why this is surprising

replace_subtree is the "throw this subtree away and run this other graph in its place" primitive. A replacement node that happens to reuse the target's id is, intuitively, a fresh node (new inbound edges, possibly new config) — not a continuation of the committed one. Inheriting committed state for it is almost never what the caller wants, and it fails silently (status stays success, the workflow finalizes with the pre-replacement output).

Minimal repro (conceptual)

  1. Build a workflow A -> B, where B is a model/tool node. Run to completion; B commits with result R0.
  2. Apply a replace_subtree with target = B and a replacement graph { C (new tool node), B (same id, new config) }, edges C -> B, entry = [C], exit = [B].
  3. C runs and commits. B is not re-executed — it keeps R0 (its pre-mutation committed result) instead of running against C's output.

(Real-world hit: in Delphi a claim_verifier correction replan injected file_read(missing_file) -> synthesis reusing the synthesis id. The read ran, the synthesis never re-ran, and the run finalized with the un-corrected answer while reporting success.)

Proposals (any one resolves it; not mutually exclusive)

  1. Re-execute reused removed ids (recommended). Treat replacement node ids that intersect removed as fresh: add replacement_nodes & removed to invalidated_node_ids (and/or reset their committed projection) so they re-run. A removed-then-re-added id is a replacement, not a continuation.
  2. Reject the collision loudly. Extend the existing conflict guard to also raise when replacement_nodes & removed is non-empty, mirroring the kept_nodes guard. This would have surfaced the bug as an error at mutation time rather than a silent no-op, forcing callers to use fresh ids.
  3. At minimum, document the contract: replacement node ids must be disjoint from exclusive_descendants_of(target, include_self: true); reusing them yields committed-state inheritance.

Option 1 is the least surprising for callers; option 2 is the safest (fail-fast). Option 3 alone leaves the silent footgun in place.

Current workaround (caller side)

Use a fresh id for any replacement node that should re-run, and make it the exit_node, so the target's successors rewire to it. Delphi does this in both replan paths (native-tool-call replan uses <node>_rerun; the claim-verifier replan now uses <target>_reverify). Documenting/enforcing this in the kernel would remove a sharp edge that's easy to hit.

Version

Observed against the v1.4.0 contract (the tag Delphi pins); plan_replace_subtree/build_replaced_graph are unchanged on main @ v1.5.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions