feat(cache): preserve per-(candidate, example) side_info across cache hits#37
Open
KE7 wants to merge 3 commits into
Open
feat(cache): preserve per-(candidate, example) side_info across cache hits#37KE7 wants to merge 3 commits into
KE7 wants to merge 3 commits into
Conversation
… hits
Extend `helix.eval_cache.CachedEvaluation` with a per-example
`side_info` slot mirroring GEPA's `OptimizeAnythingAdapter._eval_cache`
precedent (gepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:92,
200-216). Thread the slot through `put`, `put_batch`, and
`evaluate_with_cache_full` (evaluator callable now returns a 4-tuple,
helper now returns a 5-tuple), and drop the
`fresh_side_info_by_id` closure side-channel in
`evolution._cached_evaluate_batch` so reflection feedstock comes
straight from the cache.
Concrete benefit: LIBERO `evaluation_diagnostics`, `judge_metrics`,
`evaluator_error`, and `video_path` now survive a second visit to the
same (candidate, example), and the mutator's `## Diagnostics` section
sees them on cache hits instead of an empty `{}` placeholder.
Backwards-compatible: pre-extension `eval_cache.pkl` files unpickle
with `side_info=None` (dataclass default), which gracefully degrades to
the pre-extension behaviour.
Tests:
- `test_eval_cache.py` — new put/get round-trip, put_batch round-trip,
and `evaluate_with_cache_full` round-trip including an evaluator that
asserts it is never re-invoked on a full cache hit.
- `test_evolution_minibatch.py::test_partial_cache_hit_per_example_fields_merge`
— updated to pre-populate side_info and assert it round-trips.
- `test_evolution_minibatch.py::test_partial_cache_hit_side_info_repopulated_after_fresh_eval`
— new cold-miss → cold-hit cycle on `_cached_evaluate_batch`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… paths
Repo-wide cleanup of references to private audit notes that never
existed in the repository:
- Source comments / docstrings: removed `/tmp/audit_<topic>.md:lines`
pointers and `/tmp/gepa-official/...` paths. Replaced with public
GEPA repo paths (`src/gepa/...`) or `github.com/gepa-ai/gepa` where
the citation was substantive. Collapsed stale internal audit-section
tokens (`audit C1`, `audit M3`, `audit-rng-state-persist D1`,
`audit-mutation §C4`, `MODERATE D — audit-mutation.md C3`, ...) into
plain `GEPA parity:` text or substantive GEPA file:line refs.
- Tests: removed the same dangling doc pointers. Replaced synthetic
test path strings (`/tmp/helix/{cid}`, `/tmp/wt`, `/tmp/fake-worktree`,
`/tmp/train.jsonl`, `Path("/tmp")`, etc.) with `/fake/...` so the repo
stops claiming to use the system temp directory for things that never
touch the filesystem.
- `tests/unit/test_mutator.py::test_gemini_and_opencode_cli_args`
switched to `tmp_path`: the opencode branch genuinely materialises
`<worktree>/.helix_opencode_state` for per-candidate SQLite isolation
(mutator.py:1543), so it needs a real writable directory rather than
a synthetic path string.
873 unit tests pass; `mypy --strict src/helix/` clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit added ``CachedEvaluation.side_info`` but kept old
``eval_cache.pkl`` files loadable, claiming "backwards compatibility".
That claim was wrong: pre-extension entries unpickle with
``side_info=None``, which falls through to the same ``{}`` placeholder
the pre-extension code produced — silently reproducing the LIBERO
reflection-feedstock regression the ``side_info`` slot exists to fix.
Add an explicit ``EVAL_CACHE_SCHEMA_VERSION = 1`` and wrap the persisted
cache as ``{"schema_version": int, "entries": dict}``. ``load_eval_cache``
quarantines any payload whose ``schema_version`` does not match (including
the bare-dict shape previous HELIX wrote), so the next eval pass
repopulates the cache with the new shape rather than silently keeping
the broken-on-cache-hit behaviour. The old payload is preserved on
disk under a timestamped ``.corrupt-schema-*`` suffix for diagnostics.
Tests:
- ``test_eval_cache_load_rejects_pre_extension_payload`` — bare dict
without envelope is quarantined.
- ``test_eval_cache_load_rejects_wrong_schema_version`` — future
``schema_version`` is quarantined (forward-compat).
- ``test_eval_cache_load_rejects_malformed_envelope`` — versioned
payload missing the ``entries`` dict is quarantined.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend
helix.eval_cache.CachedEvaluationwith a per-exampleside_infoslot so reflection feedstock (LIBEROevaluation_diagnostics,judge_metrics,evaluator_error,video_path, ...) survives a second visit to the same (candidate, example). Mirrors GEPA'sOptimizeAnythingAdapter._eval_cachepattern atgepa/adapters/optimize_anything_adapter/optimize_anything_adapter.py:200-216, which already round-trips the rich(score, output, side_info)tuple precisely so reflection prompts don't lose context on cache hits.Before: HELIX cached
(output, score, objective_scores)only. Cache hits gave the mutator's## Diagnosticssection an empty{}placeholder for every previously-seen (candidate, example) pair, silently dropping the LIBERO feedback signal the second time around. The closure side-channel (fresh_side_info_by_id) in_cached_evaluate_batchwas an explicit acknowledgement of this gap.After:
CachedEvaluation.side_infocarries the per-example payload alongside score/output/objective_scores; the closure side-channel is gone; cache hits return the dict that was stored at the original eval time.Bundled with: repo-wide docs hygiene sweep — removed dangling
/tmp/<audit>.mddoc pointers from source and tests (these referenced private audit notes that never existed in the repo), replaced synthetic/tmp/...worktree-path strings with/fake/..., and collapsed stale internal audit-section parentheticals into substantive GEPA file:line refs. See the second commit for details.What changed
Cache feature (
8c60a71):src/helix/eval_cache.py: addside_info: dict[str, Any] | None = Noneslot toCachedEvaluation; thread it throughput,put_batch,evaluate_with_cache_full. Evaluator callable returns a 4-tuple (outputs, scores, objective_scores, side_infos), helper returns a 5-tuple (addedside_info_by_id).src/helix/evolution.py::_cached_evaluate_batch: drop thefresh_side_info_by_idclosure side-channel; read side_info straight off the cache.tests/unit/test_eval_cache.py: 3 new positive tests including a "evaluator must not run on full cache hit" assertion that locks the cost-savings invariant.tests/unit/test_evolution_minibatch.py: updatedtest_partial_cache_hit_per_example_fields_merge(previously asserted{}for cache-hit ids; now asserts the stored side_info comes back) + newtest_partial_cache_hit_side_info_repopulated_after_fresh_evalcovering the cold-miss → cold-hit cycle on_cached_evaluate_batch.Docs hygiene (
defb748):/tmp/audit_*.md,/tmp/gepa_*.md,/tmp/gepa-official/...reference from source and tests./tmp/helix/{cid},/tmp/wt,/tmp/fake-worktree,Path("/tmp"), etc. with/fake/...equivalents.merge-pairing audit X,merge-gate audit X,rng-state-persist X,audit-mutation §X,audit-init-engine.md Xparenthetical tokens into either substantive GEPA file:line refs (where the surrounding text named one) or plainGEPA parity:.test_gemini_and_opencode_cli_argsswitched totmp_pathbecause the opencode branch actuallymkdirs under the worktree path (mutator.py:1543) for per-candidate SQLite isolation, and/fake/wtcorrectly refuses to be created.Backwards compatibility
Pre-extension
eval_cache.pklfiles unpickle withside_info=Noneon every entry (dataclass default), which falls through to the existing "no reflection feedstock available" path inside_cached_evaluate_batchand surfaces as the same{}placeholder the old code produced. No data migration is needed.The evaluator callable signature passed into
EvaluationCache.evaluate_with_cache_fullis a 4-tuple now (was 3-tuple) — internal-only, no external callers exist outside this PR.Test plan
uv run pytest tests/unit/ -q— 873 passed locally on the branch tip.uv run mypy --strict src/helix/— clean (29 source files).grep -rn '/tmp\|merge-pairing audit\|merge-gate audit\|rng-state-persist\|audit-mutation\|audit-init-engine' src/ tests/— zero matches.🤖 Generated with Claude Code