chore(prompts): drop vestigial [MUTATION COMPLETE] / [SUMMARY] protocol#38
Merged
Conversation
The four prompt templates (``AUTONOMOUS_SYSTEM_PROMPT``, ``MUTATION_PROMPT_TEMPLATE``, ``SEEDLESS_INIT_PROMPT_TEMPLATE``, ``MERGE_PROMPT_TEMPLATE``) instructed the agent to emit a ``[MUTATION COMPLETE]`` / ``[MERGE COMPLETE]`` / ``[SEED GENERATION COMPLETE]`` sentinel when finished. HELIX never parsed any of those sentinels — subprocess exit is the actual stop signal and every backend handles termination internally. Separately, ``mutator.parse_mutation_summary`` scanned for ``[SUMMARY]...[END SUMMARY]`` key/value blocks that no prompt asked the agent to emit and no production code path called. Dead code, dead tests, dead protocol on both ends. Removed: - the trailing "print this completion marker" sentence + sentinel line from all four prompt templates (editing instructions preserved). - ``mutator.parse_mutation_summary`` (zero production callers). - ``tests/unit/test_semlog.py`` (sole consumer of the parser). - three prompt-substring assertions in ``test_mutator.py``, ``test_mutator_seedless.py``, and ``test_merger.py`` that pinned the presence of the removed sentinel strings. 851 unit tests pass (873 → 851 after dropping 22 sentinel-protocol assertions); ``mypy --strict src/helix/`` clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror GEPA O.A.'s ``_build_reflection_prompt_template`` accumulator pattern (``gepa/optimize_anything.py:501-596``) in ``build_mutation_prompt`` and ``build_merge_prompt``: each section is appended only when its content is non-empty, instead of rendering a placeholder string like ``"(no additional background provided)"`` / ``"(no scores recorded)"`` / ``"(no diff — candidates are identical)"`` / ``"(no evaluation data)"`` that taught the agent nothing. Sections now optional in ``build_mutation_prompt``: - ``## Objective`` — when ``objective`` is empty. - ``## Current Evaluation Scores`` — when ``eval_result.scores`` is empty. - ``## Diagnostics`` — when neither ``per_example_side_info`` nor ``side_info`` is populated (already conditional pre-PR). - ``## Evaluator Notes`` — when ``asi.log`` is empty (already conditional pre-PR). - ``## Evaluator Output`` — when the evaluator succeeded and both stdout/stderr are empty. Failed evaluator (non-zero ``_returncode``) still emits the section with ``(no stdout)`` / ``(no stderr)`` placeholders, because the agent needs to know the failure produced no output to inspect (a meaningful diagnostic on its own). Partial coverage now renders only the stream that has content instead of padding the empty one. - ``### Extra Evaluator Info`` — when no free-form ASI keys (already conditional pre-PR). - ``## Background / Context`` — when ``background`` is None/empty. Sections now optional in ``build_merge_prompt``: - ``## Objective`` — when ``objective`` is empty. - ``## Candidate A Strengths`` — when ``eval_result_a`` is None. - ``## Candidate B Strengths`` — when ``eval_result_b`` is None. - ``## Diff (B relative to A)`` — when the diff is empty after stripping. - ``## Background / Context`` — when ``background`` is None/empty. Always emitted: - ``AUTONOMOUS_SYSTEM_PROMPT`` (the four "Task instructions" bullets). - ``## Your Task`` (the editing-instruction block). - ``## Turn Budget`` — when ``max_turns`` is provided (already conditional pre-PR). Removed ``MUTATION_PROMPT_TEMPLATE`` and ``MERGE_PROMPT_TEMPLATE`` constants since the prompt is now assembled dynamically. Extracted new helpers ``_render_scores_section``, ``_render_extra_asi``, ``_render_diagnostics`` for consistency with the existing ``_render_evaluator_notes`` / ``_render_evaluator_output_fallback``. Tests updated: - ``test_default_background_when_none`` → ``test_background_section_omitted_when_none`` in both ``test_mutator.py`` and ``test_merger.py``. - ``test_no_scores_fallback`` → ``test_scores_section_omitted_when_empty``. - ``test_handles_none_eval_results`` → ``test_strengths_sections_omitted_when_eval_results_none``. - ``test_empty_diff_shows_fallback`` → ``test_diff_section_omitted_when_empty``. Each updated test now asserts the section is *absent* from the prompt (positive verification of the new behaviour) AND that the previous placeholder string is also absent (so a future regression that reintroduces the placeholder can't pass). 851 unit tests pass; ``mypy --strict src/helix/`` clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-backend doc
Two related changes to ``_turn_budget_section``:
1. **Article-agreement fix.** Pre-fix the section always rendered
``"You have a {N}-turn limit"``, which is ungrammatical for the
8 / 11 / 18 / 80s cases ("a 8-turn", "a 11-turn", "a 18-turn",
"a 80-turn"...). New ``_indefinite_article(n)`` helper picks
``"a"`` vs ``"an"`` based on the spoken pronunciation of the
leading digit group within HELIX's realistic max-turns range
(1 ≤ n ≤ ~1000).
2. **Cross-backend enforcement docs.** ``--max-turns N`` is passed to
the Claude Code CLI by ``_build_cli_args`` (``mutator.py:731-732``)
and triggers hard subprocess-level enforcement via Claude's runtime
(the ``subtype="error_max_turns"`` response handled at
``mutator.py:1667-1669``). None of the other installed backends
(``codex``, ``cursor``, ``gemini``, ``opencode``) expose an
equivalent CLI flag — verified against their ``--help`` output, none
has ``--max-turns`` / ``--max-iterations`` / ``--turn-limit`` /
``--limit``. For those backends the in-prompt ``## Turn Budget``
section is a soft hint only; whether the agent self-honors it is
entirely up to its own behaviour.
The section is still emitted for every backend (soft hints have
some value), but the docstring now states the enforcement asymmetry
explicitly so callers depending on hard caps know to use the
``claude`` backend or add subprocess-level mechanisms (wall-clock
timeout, sandbox limits) themselves.
Tests: new ``TestTurnBudgetArticleAgreement`` covers (a) consonant-
leading numbers using ``"a"``, (b) vowel-leading numbers (8, 11, 18,
80s, 800s) using ``"an"``, and (c) ``max_turns=None`` returning empty.
854 unit tests pass (851 → 854); ``mypy --strict src/helix/`` clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bution
When a common ancestor is available, the merge prompt now renders TWO
labelled diff sections — ``git diff ancestor..candidate_a`` and
``git diff ancestor..candidate_b`` — instead of the single
``git diff candidate_a..candidate_b``. The agent can read off each
parent's contribution directly rather than inferring three-way info
from a two-way comparison.
This is the file-hunk-level analogue of GEPA's component-wise
attribution at ``gepa/proposer/merge.py:163-191``:
if pred_anc == pred_id1 or pred_anc == pred_id2:
# one parent didn't change this predictor → take the other one's
...
elif pred_anc != pred_id1 and pred_anc != pred_id2:
# both diverged → tiebreak by score
...
GEPA's algorithm has named components. HELIX has a worktree, so we
can't pick "component X from parent Y" deterministically — but feeding
the agent the three-way diff structure GEPA's algorithm uses gives it
the same shape of attribution information for free-form file edits.
Behavioural changes:
- ``merge()`` gains an optional ``ancestor: Candidate | None = None``.
When provided, computes both ancestor-relative diffs and passes them
to the prompt builder. When ``None``, falls back to the legacy
single A↔B diff.
- ``build_merge_prompt`` gains three optional keyword-only parameters:
``ancestor_id``, ``diff_a_from_ancestor``, ``diff_b_from_ancestor``.
Two-diff form requires all three; any half-configured combination
defensively falls back to the single A↔B path.
- A dedicated ``MERGE_TASK_INSTRUCTIONS_TWO_DIFF`` task block
accompanies the two-diff form. It explicitly tells the agent that
Candidate A's contribution is already in the working tree (so it
doesn't re-apply it) and that B's contribution is what needs to be
brought in. Single-diff form retains the legacy task framing
unchanged.
- ``evolution._run_evolution_impl`` resolves the ancestor candidate
from the frontier's append-only candidate map (using the public
``frontier.candidates`` view) and passes it to ``merge()``. When
the ancestor isn't resolvable (defensive: lineage / frontier drift),
logs a warning that names the merge_id and falls back to single-diff.
Tests (6 new in ``test_merger.py``):
- ``test_emits_two_ancestor_relative_sections`` — happy path renders
both ancestor-relative sections and omits the legacy A↔B header.
- ``test_two_diff_form_uses_two_diff_task_block`` /
``test_single_diff_form_uses_single_diff_task_block`` — regression
pins on which task-instruction block accompanies which diff form.
- ``test_single_diff_fallback_when_ancestor_missing`` /
``test_single_diff_fallback_when_ancestor_id_only`` — backward
compat plus the half-configured-caller defensive fallback.
- ``test_two_diff_form_omits_empty_side`` — one ancestor diff empty
→ only the populated side renders.
- ``test_ancestor_triggers_two_diff_form`` /
``test_no_ancestor_uses_single_diff_form`` — ``merge()``-level
assertions on the exact ``get_diff`` call sequence (two
ancestor-anchored calls vs one A↔B call) and the resulting prompt
content.
862 unit tests pass (860 → 862); ``mypy --strict src/helix/`` clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The agent prompt templates instruct each coding agent (Claude Code / Codex / Cursor / Gemini / opencode) to emit a sentinel —
[MUTATION COMPLETE],[MERGE COMPLETE], or[SEED GENERATION COMPLETE]— when it's done. HELIX never parses any of these sentinels. Subprocess exit is the actual termination signal; every backend handles its own stop logic internally.Separately,
mutator.parse_mutation_summaryscans for[SUMMARY]…[END SUMMARY]key/value blocks that no prompt instructs the agent to emit and no production code path invokes.So the apparatus is vestigial on both ends:
[MUTATION COMPLETE][MERGE COMPLETE][SEED GENERATION COMPLETE][SUMMARY]…[END SUMMARY]parse_mutation_summary)What changed
src/helix/mutator.py: drop the trailing "print[MUTATION COMPLETE]" / "print[SEED GENERATION COMPLETE]" lines fromAUTONOMOUS_SYSTEM_PROMPT,MUTATION_PROMPT_TEMPLATE,SEEDLESS_INIT_PROMPT_TEMPLATE. Deleteparse_mutation_summary(zero production callers).src/helix/merger.py: drop the trailing "print[MERGE COMPLETE]" line fromMERGE_PROMPT_TEMPLATE.tests/unit/test_semlog.py: delete — sole consumer ofparse_mutation_summary.tests/unit/test_mutator.py,tests/unit/test_mutator_seedless.py,tests/unit/test_merger.py: drop three prompt-substring assertions that pinned the presence of the removed sentinels.Net: -221 lines / +1 line. Every editing instruction in every prompt is preserved verbatim — the only text removed from prompts is the "print sentinel X when done" instruction and the sentinel itself.
Why this is safe
Termination model is unchanged because nothing in HELIX ever depended on the sentinel:
finish_reason, opencode'sstep_finish, Codex CLI exit, etc.). HELIX waits on subprocess exit and treats whatever was written to stdout/stderr as the captured transcript.parse_mutation_summarywas already returning{}in practice because the prompt never asked the agent for[SUMMARY]…[END SUMMARY]blocks. Removing the parser changes no observable behaviour._count_*_tool_events), session-id capture, and rate-limit detection — none of which look for the sentinel.Test plan
uv run pytest tests/unit/ -q— 851 passed (873 → 851 after removing 22 sentinel-protocol assertions acrosstest_semlog.pyand three substring checks).uv run mypy --strict src/helix/— clean (29 source files).grep -rn "MUTATION COMPLETE\|MERGE COMPLETE\|SEED GENERATION COMPLETE\|parse_mutation_summary" src/ tests/— zero matches.Independent of PR #37
This PR is off
origin/maindirectly, not stacked on the cache PR. Either can land in either order.🤖 Generated with Claude Code