Skip to content

chore(prompts): drop vestigial [MUTATION COMPLETE] / [SUMMARY] protocol#38

Merged
KE7 merged 4 commits into
mainfrom
chore/drop-vestigial-mutation-summary
May 30, 2026
Merged

chore(prompts): drop vestigial [MUTATION COMPLETE] / [SUMMARY] protocol#38
KE7 merged 4 commits into
mainfrom
chore/drop-vestigial-mutation-summary

Conversation

@KE7
Copy link
Copy Markdown
Owner

@KE7 KE7 commented May 20, 2026

Summary

The agent prompt templates instruct each coding agent (Claude Code / Codex / Cursor / Gemini / opencode) to emit a sentinel — [MUTATION COMPLETE], [MERGE COMPLETE], or [SEED GENERATION COMPLETE] — when it's done. HELIX never parses any of these sentinels. Subprocess exit is the actual termination signal; every backend handles its own stop logic internally.

Separately, mutator.parse_mutation_summary scans for [SUMMARY]…[END SUMMARY] key/value blocks that no prompt instructs the agent to emit and no production code path invokes.

So the apparatus is vestigial on both ends:

Asks the agent for it? Parses it? Wired to anything?
[MUTATION COMPLETE] yes no no
[MERGE COMPLETE] yes no no
[SEED GENERATION COMPLETE] yes no no
[SUMMARY]…[END SUMMARY] no yes (parse_mutation_summary) no — dead code

What changed

  • src/helix/mutator.py: drop the trailing "print [MUTATION COMPLETE]" / "print [SEED GENERATION COMPLETE]" lines from AUTONOMOUS_SYSTEM_PROMPT, MUTATION_PROMPT_TEMPLATE, SEEDLESS_INIT_PROMPT_TEMPLATE. Delete parse_mutation_summary (zero production callers).
  • src/helix/merger.py: drop the trailing "print [MERGE COMPLETE]" line from MERGE_PROMPT_TEMPLATE.
  • tests/unit/test_semlog.py: delete — sole consumer of parse_mutation_summary.
  • tests/unit/test_mutator.py, tests/unit/test_mutator_seedless.py, tests/unit/test_merger.py: drop three prompt-substring assertions that pinned the presence of the removed sentinels.

Net: -221 lines / +1 line. Every editing instruction in every prompt is preserved verbatim — the only text removed from prompts is the "print sentinel X when done" instruction and the sentinel itself.

Why this is safe

Termination model is unchanged because nothing in HELIX ever depended on the sentinel:

  1. Each agent backend has its own internal stop logic (model finish_reason, opencode's step_finish, Codex CLI exit, etc.). HELIX waits on subprocess exit and treats whatever was written to stdout/stderr as the captured transcript.
  2. parse_mutation_summary was already returning {} in practice because the prompt never asked the agent for [SUMMARY]…[END SUMMARY] blocks. Removing the parser changes no observable behaviour.
  3. Mutator/merger output processing reads agent stdout for tool-call counting (_count_*_tool_events), session-id capture, and rate-limit detection — none of which look for the sentinel.

Test plan

  • uv run pytest tests/unit/ -q — 851 passed (873 → 851 after removing 22 sentinel-protocol assertions across test_semlog.py and three substring checks).
  • uv run mypy --strict src/helix/ — clean (29 source files).
  • grep -rn "MUTATION COMPLETE\|MERGE COMPLETE\|SEED GENERATION COMPLETE\|parse_mutation_summary" src/ tests/ — zero matches.
  • CI runs the same two commands on Python 3.11 and 3.12.

Independent of PR #37

This PR is off origin/main directly, not stacked on the cache PR. Either can land in either order.

🤖 Generated with Claude Code

KE7 and others added 4 commits May 20, 2026 12:07
The four prompt templates (``AUTONOMOUS_SYSTEM_PROMPT``,
``MUTATION_PROMPT_TEMPLATE``, ``SEEDLESS_INIT_PROMPT_TEMPLATE``,
``MERGE_PROMPT_TEMPLATE``) instructed the agent to emit a
``[MUTATION COMPLETE]`` / ``[MERGE COMPLETE]`` / ``[SEED GENERATION
COMPLETE]`` sentinel when finished.  HELIX never parsed any of those
sentinels — subprocess exit is the actual stop signal and every backend
handles termination internally.

Separately, ``mutator.parse_mutation_summary`` scanned for
``[SUMMARY]...[END SUMMARY]`` key/value blocks that no prompt asked the
agent to emit and no production code path called.  Dead code, dead
tests, dead protocol on both ends.

Removed:
- the trailing "print this completion marker" sentence + sentinel line
  from all four prompt templates (editing instructions preserved).
- ``mutator.parse_mutation_summary`` (zero production callers).
- ``tests/unit/test_semlog.py`` (sole consumer of the parser).
- three prompt-substring assertions in ``test_mutator.py``,
  ``test_mutator_seedless.py``, and ``test_merger.py`` that pinned the
  presence of the removed sentinel strings.

851 unit tests pass (873 → 851 after dropping 22 sentinel-protocol
assertions); ``mypy --strict src/helix/`` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror GEPA O.A.'s ``_build_reflection_prompt_template`` accumulator
pattern (``gepa/optimize_anything.py:501-596``) in ``build_mutation_prompt``
and ``build_merge_prompt``: each section is appended only when its
content is non-empty, instead of rendering a placeholder string like
``"(no additional background provided)"`` / ``"(no scores recorded)"``
/ ``"(no diff — candidates are identical)"`` / ``"(no evaluation
data)"`` that taught the agent nothing.

Sections now optional in ``build_mutation_prompt``:
- ``## Objective`` — when ``objective`` is empty.
- ``## Current Evaluation Scores`` — when ``eval_result.scores`` is empty.
- ``## Diagnostics`` — when neither ``per_example_side_info`` nor
  ``side_info`` is populated (already conditional pre-PR).
- ``## Evaluator Notes`` — when ``asi.log`` is empty (already
  conditional pre-PR).
- ``## Evaluator Output`` — when the evaluator succeeded and both
  stdout/stderr are empty.  Failed evaluator (non-zero ``_returncode``)
  still emits the section with ``(no stdout)`` / ``(no stderr)``
  placeholders, because the agent needs to know the failure produced
  no output to inspect (a meaningful diagnostic on its own).  Partial
  coverage now renders only the stream that has content instead of
  padding the empty one.
- ``### Extra Evaluator Info`` — when no free-form ASI keys (already
  conditional pre-PR).
- ``## Background / Context`` — when ``background`` is None/empty.

Sections now optional in ``build_merge_prompt``:
- ``## Objective`` — when ``objective`` is empty.
- ``## Candidate A Strengths`` — when ``eval_result_a`` is None.
- ``## Candidate B Strengths`` — when ``eval_result_b`` is None.
- ``## Diff (B relative to A)`` — when the diff is empty after
  stripping.
- ``## Background / Context`` — when ``background`` is None/empty.

Always emitted:
- ``AUTONOMOUS_SYSTEM_PROMPT`` (the four "Task instructions" bullets).
- ``## Your Task`` (the editing-instruction block).
- ``## Turn Budget`` — when ``max_turns`` is provided (already
  conditional pre-PR).

Removed ``MUTATION_PROMPT_TEMPLATE`` and ``MERGE_PROMPT_TEMPLATE``
constants since the prompt is now assembled dynamically.  Extracted
new helpers ``_render_scores_section``, ``_render_extra_asi``,
``_render_diagnostics`` for consistency with the existing
``_render_evaluator_notes`` / ``_render_evaluator_output_fallback``.

Tests updated:
- ``test_default_background_when_none`` → ``test_background_section_omitted_when_none``
  in both ``test_mutator.py`` and ``test_merger.py``.
- ``test_no_scores_fallback`` → ``test_scores_section_omitted_when_empty``.
- ``test_handles_none_eval_results`` → ``test_strengths_sections_omitted_when_eval_results_none``.
- ``test_empty_diff_shows_fallback`` → ``test_diff_section_omitted_when_empty``.

Each updated test now asserts the section is *absent* from the prompt
(positive verification of the new behaviour) AND that the previous
placeholder string is also absent (so a future regression that
reintroduces the placeholder can't pass).

851 unit tests pass; ``mypy --strict src/helix/`` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-backend doc

Two related changes to ``_turn_budget_section``:

1. **Article-agreement fix.**  Pre-fix the section always rendered
   ``"You have a {N}-turn limit"``, which is ungrammatical for the
   8 / 11 / 18 / 80s cases ("a 8-turn", "a 11-turn", "a 18-turn",
   "a 80-turn"...).  New ``_indefinite_article(n)`` helper picks
   ``"a"`` vs ``"an"`` based on the spoken pronunciation of the
   leading digit group within HELIX's realistic max-turns range
   (1 ≤ n ≤ ~1000).

2. **Cross-backend enforcement docs.**  ``--max-turns N`` is passed to
   the Claude Code CLI by ``_build_cli_args`` (``mutator.py:731-732``)
   and triggers hard subprocess-level enforcement via Claude's runtime
   (the ``subtype="error_max_turns"`` response handled at
   ``mutator.py:1667-1669``).  None of the other installed backends
   (``codex``, ``cursor``, ``gemini``, ``opencode``) expose an
   equivalent CLI flag — verified against their ``--help`` output, none
   has ``--max-turns`` / ``--max-iterations`` / ``--turn-limit`` /
   ``--limit``.  For those backends the in-prompt ``## Turn Budget``
   section is a soft hint only; whether the agent self-honors it is
   entirely up to its own behaviour.

   The section is still emitted for every backend (soft hints have
   some value), but the docstring now states the enforcement asymmetry
   explicitly so callers depending on hard caps know to use the
   ``claude`` backend or add subprocess-level mechanisms (wall-clock
   timeout, sandbox limits) themselves.

Tests: new ``TestTurnBudgetArticleAgreement`` covers (a) consonant-
leading numbers using ``"a"``, (b) vowel-leading numbers (8, 11, 18,
80s, 800s) using ``"an"``, and (c) ``max_turns=None`` returning empty.

854 unit tests pass (851 → 854); ``mypy --strict src/helix/`` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bution

When a common ancestor is available, the merge prompt now renders TWO
labelled diff sections — ``git diff ancestor..candidate_a`` and
``git diff ancestor..candidate_b`` — instead of the single
``git diff candidate_a..candidate_b``.  The agent can read off each
parent's contribution directly rather than inferring three-way info
from a two-way comparison.

This is the file-hunk-level analogue of GEPA's component-wise
attribution at ``gepa/proposer/merge.py:163-191``:

  if pred_anc == pred_id1 or pred_anc == pred_id2:
      # one parent didn't change this predictor → take the other one's
      ...
  elif pred_anc != pred_id1 and pred_anc != pred_id2:
      # both diverged → tiebreak by score
      ...

GEPA's algorithm has named components.  HELIX has a worktree, so we
can't pick "component X from parent Y" deterministically — but feeding
the agent the three-way diff structure GEPA's algorithm uses gives it
the same shape of attribution information for free-form file edits.

Behavioural changes:
- ``merge()`` gains an optional ``ancestor: Candidate | None = None``.
  When provided, computes both ancestor-relative diffs and passes them
  to the prompt builder.  When ``None``, falls back to the legacy
  single A↔B diff.
- ``build_merge_prompt`` gains three optional keyword-only parameters:
  ``ancestor_id``, ``diff_a_from_ancestor``, ``diff_b_from_ancestor``.
  Two-diff form requires all three; any half-configured combination
  defensively falls back to the single A↔B path.
- A dedicated ``MERGE_TASK_INSTRUCTIONS_TWO_DIFF`` task block
  accompanies the two-diff form.  It explicitly tells the agent that
  Candidate A's contribution is already in the working tree (so it
  doesn't re-apply it) and that B's contribution is what needs to be
  brought in.  Single-diff form retains the legacy task framing
  unchanged.
- ``evolution._run_evolution_impl`` resolves the ancestor candidate
  from the frontier's append-only candidate map (using the public
  ``frontier.candidates`` view) and passes it to ``merge()``.  When
  the ancestor isn't resolvable (defensive: lineage / frontier drift),
  logs a warning that names the merge_id and falls back to single-diff.

Tests (6 new in ``test_merger.py``):
- ``test_emits_two_ancestor_relative_sections`` — happy path renders
  both ancestor-relative sections and omits the legacy A↔B header.
- ``test_two_diff_form_uses_two_diff_task_block`` /
  ``test_single_diff_form_uses_single_diff_task_block`` — regression
  pins on which task-instruction block accompanies which diff form.
- ``test_single_diff_fallback_when_ancestor_missing`` /
  ``test_single_diff_fallback_when_ancestor_id_only`` — backward
  compat plus the half-configured-caller defensive fallback.
- ``test_two_diff_form_omits_empty_side`` — one ancestor diff empty
  → only the populated side renders.
- ``test_ancestor_triggers_two_diff_form`` /
  ``test_no_ancestor_uses_single_diff_form`` — ``merge()``-level
  assertions on the exact ``get_diff`` call sequence (two
  ancestor-anchored calls vs one A↔B call) and the resulting prompt
  content.

862 unit tests pass (860 → 862); ``mypy --strict src/helix/`` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@KE7 KE7 merged commit 394a1b7 into main May 30, 2026
2 checks passed
@KE7 KE7 deleted the chore/drop-vestigial-mutation-summary branch May 30, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant