Skip to content

feat(cache): preserve per-(candidate, example) side_info across cache hits#37

Open
KE7 wants to merge 3 commits into
mainfrom
feat/cache-side-info-parity
Open

feat(cache): preserve per-(candidate, example) side_info across cache hits#37
KE7 wants to merge 3 commits into
mainfrom
feat/cache-side-info-parity

Conversation

@KE7
Copy link
Copy Markdown
Owner

@KE7 KE7 commented May 20, 2026

Summary

Extend helix.eval_cache.CachedEvaluation with a per-example side_info slot so reflection feedstock (LIBERO evaluation_diagnostics, judge_metrics, evaluator_error, video_path, ...) survives a second visit to the same (candidate, example). Mirrors GEPA's OptimizeAnythingAdapter._eval_cache pattern at gepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:200-216, which already round-trips the rich (score, output, side_info) tuple precisely so reflection prompts don't lose context on cache hits.

Before: HELIX cached (output, score, objective_scores) only. Cache hits gave the mutator's ## Diagnostics section an empty {} placeholder for every previously-seen (candidate, example) pair, silently dropping the LIBERO feedback signal the second time around. The closure side-channel (fresh_side_info_by_id) in _cached_evaluate_batch was an explicit acknowledgement of this gap.

After: CachedEvaluation.side_info carries the per-example payload alongside score/output/objective_scores; the closure side-channel is gone; cache hits return the dict that was stored at the original eval time.

Bundled with: repo-wide docs hygiene sweep — removed dangling /tmp/<audit>.md doc pointers from source and tests (these referenced private audit notes that never existed in the repo), replaced synthetic /tmp/... worktree-path strings with /fake/..., and collapsed stale internal audit-section parentheticals into substantive GEPA file:line refs. See the second commit for details.

What changed

Cache feature (8c60a71):

  • src/helix/eval_cache.py: add side_info: dict[str, Any] | None = None slot to CachedEvaluation; thread it through put, put_batch, evaluate_with_cache_full. Evaluator callable returns a 4-tuple (outputs, scores, objective_scores, side_infos), helper returns a 5-tuple (added side_info_by_id).
  • src/helix/evolution.py::_cached_evaluate_batch: drop the fresh_side_info_by_id closure side-channel; read side_info straight off the cache.
  • tests/unit/test_eval_cache.py: 3 new positive tests including a "evaluator must not run on full cache hit" assertion that locks the cost-savings invariant.
  • tests/unit/test_evolution_minibatch.py: updated test_partial_cache_hit_per_example_fields_merge (previously asserted {} for cache-hit ids; now asserts the stored side_info comes back) + new test_partial_cache_hit_side_info_repopulated_after_fresh_eval covering the cold-miss → cold-hit cycle on _cached_evaluate_batch.

Docs hygiene (defb748):

  • 23 files, comment-only / docstring-only / test-fixture-string-only changes.
  • Removed every /tmp/audit_*.md, /tmp/gepa_*.md, /tmp/gepa-official/... reference from source and tests.
  • Replaced synthetic /tmp/helix/{cid}, /tmp/wt, /tmp/fake-worktree, Path("/tmp"), etc. with /fake/... equivalents.
  • Collapsed stale merge-pairing audit X, merge-gate audit X, rng-state-persist X, audit-mutation §X, audit-init-engine.md X parenthetical tokens into either substantive GEPA file:line refs (where the surrounding text named one) or plain GEPA parity:.
  • One test repair: test_gemini_and_opencode_cli_args switched to tmp_path because the opencode branch actually mkdirs under the worktree path (mutator.py:1543) for per-candidate SQLite isolation, and /fake/wt correctly refuses to be created.

Backwards compatibility

Pre-extension eval_cache.pkl files unpickle with side_info=None on every entry (dataclass default), which falls through to the existing "no reflection feedstock available" path inside _cached_evaluate_batch and surfaces as the same {} placeholder the old code produced. No data migration is needed.

The evaluator callable signature passed into EvaluationCache.evaluate_with_cache_full is a 4-tuple now (was 3-tuple) — internal-only, no external callers exist outside this PR.

Test plan

  • uv run pytest tests/unit/ -q — 873 passed locally on the branch tip.
  • uv run mypy --strict src/helix/ — clean (29 source files).
  • grep -rn '/tmp\|merge-pairing audit\|merge-gate audit\|rng-state-persist\|audit-mutation\|audit-init-engine' src/ tests/ — zero matches.
  • CI runs the same two commands on Python 3.11 and 3.12.

🤖 Generated with Claude Code

KE7 and others added 3 commits May 20, 2026 11:35
… hits

Extend `helix.eval_cache.CachedEvaluation` with a per-example
`side_info` slot mirroring GEPA's `OptimizeAnythingAdapter._eval_cache`
precedent (gepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:92,
200-216). Thread the slot through `put`, `put_batch`, and
`evaluate_with_cache_full` (evaluator callable now returns a 4-tuple,
helper now returns a 5-tuple), and drop the
`fresh_side_info_by_id` closure side-channel in
`evolution._cached_evaluate_batch` so reflection feedstock comes
straight from the cache.

Concrete benefit: LIBERO `evaluation_diagnostics`, `judge_metrics`,
`evaluator_error`, and `video_path` now survive a second visit to the
same (candidate, example), and the mutator's `## Diagnostics` section
sees them on cache hits instead of an empty `{}` placeholder.

Backwards-compatible: pre-extension `eval_cache.pkl` files unpickle
with `side_info=None` (dataclass default), which gracefully degrades to
the pre-extension behaviour.

Tests:
- `test_eval_cache.py` — new put/get round-trip, put_batch round-trip,
  and `evaluate_with_cache_full` round-trip including an evaluator that
  asserts it is never re-invoked on a full cache hit.
- `test_evolution_minibatch.py::test_partial_cache_hit_per_example_fields_merge`
  — updated to pre-populate side_info and assert it round-trips.
- `test_evolution_minibatch.py::test_partial_cache_hit_side_info_repopulated_after_fresh_eval`
  — new cold-miss → cold-hit cycle on `_cached_evaluate_batch`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… paths

Repo-wide cleanup of references to private audit notes that never
existed in the repository:

- Source comments / docstrings: removed `/tmp/audit_<topic>.md:lines`
  pointers and `/tmp/gepa-official/...` paths.  Replaced with public
  GEPA repo paths (`src/gepa/...`) or `github.com/gepa-ai/gepa` where
  the citation was substantive.  Collapsed stale internal audit-section
  tokens (`audit C1`, `audit M3`, `audit-rng-state-persist D1`,
  `audit-mutation §C4`, `MODERATE D — audit-mutation.md C3`, ...) into
  plain `GEPA parity:` text or substantive GEPA file:line refs.
- Tests: removed the same dangling doc pointers.  Replaced synthetic
  test path strings (`/tmp/helix/{cid}`, `/tmp/wt`, `/tmp/fake-worktree`,
  `/tmp/train.jsonl`, `Path("/tmp")`, etc.) with `/fake/...` so the repo
  stops claiming to use the system temp directory for things that never
  touch the filesystem.
- `tests/unit/test_mutator.py::test_gemini_and_opencode_cli_args`
  switched to `tmp_path`: the opencode branch genuinely materialises
  `<worktree>/.helix_opencode_state` for per-candidate SQLite isolation
  (mutator.py:1543), so it needs a real writable directory rather than
  a synthetic path string.

873 unit tests pass; `mypy --strict src/helix/` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit added ``CachedEvaluation.side_info`` but kept old
``eval_cache.pkl`` files loadable, claiming "backwards compatibility".
That claim was wrong: pre-extension entries unpickle with
``side_info=None``, which falls through to the same ``{}`` placeholder
the pre-extension code produced — silently reproducing the LIBERO
reflection-feedstock regression the ``side_info`` slot exists to fix.

Add an explicit ``EVAL_CACHE_SCHEMA_VERSION = 1`` and wrap the persisted
cache as ``{"schema_version": int, "entries": dict}``.  ``load_eval_cache``
quarantines any payload whose ``schema_version`` does not match (including
the bare-dict shape previous HELIX wrote), so the next eval pass
repopulates the cache with the new shape rather than silently keeping
the broken-on-cache-hit behaviour.  The old payload is preserved on
disk under a timestamped ``.corrupt-schema-*`` suffix for diagnostics.

Tests:
- ``test_eval_cache_load_rejects_pre_extension_payload`` — bare dict
  without envelope is quarantined.
- ``test_eval_cache_load_rejects_wrong_schema_version`` — future
  ``schema_version`` is quarantined (forward-compat).
- ``test_eval_cache_load_rejects_malformed_envelope`` — versioned
  payload missing the ``entries`` dict is quarantined.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant