feat(cache): preserve per-(candidate, example) side_info across cache hits by KE7 · Pull Request #37 · KE7/helix

KE7 · 2026-05-20T18:36:38Z

Summary

Extend helix.eval_cache.CachedEvaluation with a per-example side_info slot so reflection feedstock (LIBERO evaluation_diagnostics, judge_metrics, evaluator_error, video_path, ...) survives a second visit to the same (candidate, example). Mirrors GEPA's OptimizeAnythingAdapter._eval_cache pattern at gepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:200-216, which already round-trips the rich (score, output, side_info) tuple precisely so reflection prompts don't lose context on cache hits.

Before: HELIX cached (output, score, objective_scores) only. Cache hits gave the mutator's ## Diagnostics section an empty {} placeholder for every previously-seen (candidate, example) pair, silently dropping the LIBERO feedback signal the second time around. The closure side-channel (fresh_side_info_by_id) in _cached_evaluate_batch was an explicit acknowledgement of this gap.

After: CachedEvaluation.side_info carries the per-example payload alongside score/output/objective_scores; the closure side-channel is gone; cache hits return the dict that was stored at the original eval time.

Bundled with: repo-wide docs hygiene sweep — removed dangling /tmp/<audit>.md doc pointers from source and tests (these referenced private audit notes that never existed in the repo), replaced synthetic /tmp/... worktree-path strings with /fake/..., and collapsed stale internal audit-section parentheticals into substantive GEPA file:line refs. See the second commit for details.

What changed

Cache feature (8c60a71):

src/helix/eval_cache.py: add side_info: dict[str, Any] | None = None slot to CachedEvaluation; thread it through put, put_batch, evaluate_with_cache_full. Evaluator callable returns a 4-tuple (outputs, scores, objective_scores, side_infos), helper returns a 5-tuple (added side_info_by_id).
src/helix/evolution.py::_cached_evaluate_batch: drop the fresh_side_info_by_id closure side-channel; read side_info straight off the cache.
tests/unit/test_eval_cache.py: 3 new positive tests including a "evaluator must not run on full cache hit" assertion that locks the cost-savings invariant.
tests/unit/test_evolution_minibatch.py: updated test_partial_cache_hit_per_example_fields_merge (previously asserted {} for cache-hit ids; now asserts the stored side_info comes back) + new test_partial_cache_hit_side_info_repopulated_after_fresh_eval covering the cold-miss → cold-hit cycle on _cached_evaluate_batch.

Docs hygiene (defb748):

23 files, comment-only / docstring-only / test-fixture-string-only changes.
Removed every /tmp/audit_*.md, /tmp/gepa_*.md, /tmp/gepa-official/... reference from source and tests.
Replaced synthetic /tmp/helix/{cid}, /tmp/wt, /tmp/fake-worktree, Path("/tmp"), etc. with /fake/... equivalents.
Collapsed stale merge-pairing audit X, merge-gate audit X, rng-state-persist X, audit-mutation §X, audit-init-engine.md X parenthetical tokens into either substantive GEPA file:line refs (where the surrounding text named one) or plain GEPA parity:.
One test repair: test_gemini_and_opencode_cli_args switched to tmp_path because the opencode branch actually mkdirs under the worktree path (mutator.py:1543) for per-candidate SQLite isolation, and /fake/wt correctly refuses to be created.

Backwards compatibility

Pre-extension eval_cache.pkl files unpickle with side_info=None on every entry (dataclass default), which falls through to the existing "no reflection feedstock available" path inside _cached_evaluate_batch and surfaces as the same {} placeholder the old code produced. No data migration is needed.

The evaluator callable signature passed into EvaluationCache.evaluate_with_cache_full is a 4-tuple now (was 3-tuple) — internal-only, no external callers exist outside this PR.

Test plan

uv run pytest tests/unit/ -q — 873 passed locally on the branch tip.
uv run mypy --strict src/helix/ — clean (29 source files).
grep -rn '/tmp\|merge-pairing audit\|merge-gate audit\|rng-state-persist\|audit-mutation\|audit-init-engine' src/ tests/ — zero matches.
CI runs the same two commands on Python 3.11 and 3.12.

🤖 Generated with Claude Code

… hits Extend `helix.eval_cache.CachedEvaluation` with a per-example `side_info` slot mirroring GEPA's `OptimizeAnythingAdapter._eval_cache` precedent (gepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:92, 200-216). Thread the slot through `put`, `put_batch`, and `evaluate_with_cache_full` (evaluator callable now returns a 4-tuple, helper now returns a 5-tuple), and drop the `fresh_side_info_by_id` closure side-channel in `evolution._cached_evaluate_batch` so reflection feedstock comes straight from the cache. Concrete benefit: LIBERO `evaluation_diagnostics`, `judge_metrics`, `evaluator_error`, and `video_path` now survive a second visit to the same (candidate, example), and the mutator's `## Diagnostics` section sees them on cache hits instead of an empty `{}` placeholder. Backwards-compatible: pre-extension `eval_cache.pkl` files unpickle with `side_info=None` (dataclass default), which gracefully degrades to the pre-extension behaviour. Tests: - `test_eval_cache.py` — new put/get round-trip, put_batch round-trip, and `evaluate_with_cache_full` round-trip including an evaluator that asserts it is never re-invoked on a full cache hit. - `test_evolution_minibatch.py::test_partial_cache_hit_per_example_fields_merge` — updated to pre-populate side_info and assert it round-trips. - `test_evolution_minibatch.py::test_partial_cache_hit_side_info_repopulated_after_fresh_eval` — new cold-miss → cold-hit cycle on `_cached_evaluate_batch`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… paths Repo-wide cleanup of references to private audit notes that never existed in the repository: - Source comments / docstrings: removed `/tmp/audit_<topic>.md:lines` pointers and `/tmp/gepa-official/...` paths. Replaced with public GEPA repo paths (`src/gepa/...`) or `github.com/gepa-ai/gepa` where the citation was substantive. Collapsed stale internal audit-section tokens (`audit C1`, `audit M3`, `audit-rng-state-persist D1`, `audit-mutation §C4`, `MODERATE D — audit-mutation.md C3`, ...) into plain `GEPA parity:` text or substantive GEPA file:line refs. - Tests: removed the same dangling doc pointers. Replaced synthetic test path strings (`/tmp/helix/{cid}`, `/tmp/wt`, `/tmp/fake-worktree`, `/tmp/train.jsonl`, `Path("/tmp")`, etc.) with `/fake/...` so the repo stops claiming to use the system temp directory for things that never touch the filesystem. - `tests/unit/test_mutator.py::test_gemini_and_opencode_cli_args` switched to `tmp_path`: the opencode branch genuinely materialises `<worktree>/.helix_opencode_state` for per-candidate SQLite isolation (mutator.py:1543), so it needs a real writable directory rather than a synthetic path string. 873 unit tests pass; `mypy --strict src/helix/` clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous commit added ``CachedEvaluation.side_info`` but kept old ``eval_cache.pkl`` files loadable, claiming "backwards compatibility". That claim was wrong: pre-extension entries unpickle with ``side_info=None``, which falls through to the same ``{}`` placeholder the pre-extension code produced — silently reproducing the LIBERO reflection-feedstock regression the ``side_info`` slot exists to fix. Add an explicit ``EVAL_CACHE_SCHEMA_VERSION = 1`` and wrap the persisted cache as ``{"schema_version": int, "entries": dict}``. ``load_eval_cache`` quarantines any payload whose ``schema_version`` does not match (including the bare-dict shape previous HELIX wrote), so the next eval pass repopulates the cache with the new shape rather than silently keeping the broken-on-cache-hit behaviour. The old payload is preserved on disk under a timestamped ``.corrupt-schema-*`` suffix for diagnostics. Tests: - ``test_eval_cache_load_rejects_pre_extension_payload`` — bare dict without envelope is quarantined. - ``test_eval_cache_load_rejects_wrong_schema_version`` — future ``schema_version`` is quarantined (forward-compat). - ``test_eval_cache_load_rejects_malformed_envelope`` — versioned payload missing the ``entries`` dict is quarantined. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

KE7 and others added 3 commits May 20, 2026 11:35

KE7 mentioned this pull request May 20, 2026

chore(prompts): drop vestigial [MUTATION COMPLETE] / [SUMMARY] protocol #38

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cache): preserve per-(candidate, example) side_info across cache hits#37

feat(cache): preserve per-(candidate, example) side_info across cache hits#37
KE7 wants to merge 3 commits into
mainfrom
feat/cache-side-info-parity

KE7 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KE7 commented May 20, 2026

Summary

What changed

Backwards compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant