feat(peek): add optional PEEK orientation-cache integration#33
Conversation
Add PEEK (arXiv:2605.19932) orientation-cache support as an optional extra. `PeekSession` wraps peek-ai's `CachePolicy` and bridges our `ModelAdapter` to peek's `LMClient` protocol. The map is injected via the existing `system_prompt_supplement` slot on `RLM`; no changes to the core API. - src/pyrlm_runtime/peek_integration.py: adapter, `PeekSession`, and `trace_to_peek_trajectory` for serializing a `Trace` into the plain-string trajectory format peek's Distiller expects. - tests/test_peek_integration.py: 21 tests using FakeAdapter and a stub LMClient; skips PEEK-dependent tests gracefully when peek-ai is not installed. - examples/peek_bench/: benchmark harness comparing baseline RLM vs RLM+PEEK on oolongbench/oolong-synth. Supports explicit context selection (`--context-ids`) and fully-online evolution (`--evolve-steps -1`). - docs/peek-bench/PEEK-EXPERIMENTS.md: pre-committed hypothesis, decision rule, and results from three runs. Aggregate verdict is REJECT across all phases on oolong-synth. Phase 3 (N=5 largest contexts, fully-online) shows the effect is sharply dataset-dependent: agnews +7.6pp, negation +4.0pp, imdb -4.2pp, multinli -5.7pp, metaphors -10.7pp. agnews replication is consistent with the paper's Table 1. peek-ai is not on PyPI; install with `pip install git+https://github.com/zhuohangu/peek.git`.
📝 WalkthroughWalkthroughThis PR introduces a complete PEEK integration for pyrlm-runtime: a new optional library module enabling persistent context caching across RLM queries, a benchmark runner comparing baseline versus PEEK-augmented performance on the oolong-synth dataset, comprehensive tests, and documentation of project conventions and experimental results. ChangesPEEK Integration and Benchmarking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
CLAUDE.md (2)
113-114: ⚡ Quick winConsider clarifying the OBLIQ reference in a PEEK PR.
These lines reference
docs/obliq-bench/OBLIQ-OBJETIVO.mdandOBLIQ-EXPERIMENTS.mdas examples, but this PR introduces PEEK experiments. Either:
- Generalize these references (e.g., "Reference docs for active experiments:
docs/{experiment-name}/directories")- Add a note that OBLIQ is just one example
- Update to reference peek-bench as well
This would reduce confusion for readers who see PEEK files in the PR but OBLIQ references in the methodology section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CLAUDE.md` around lines 113 - 114, Update the reference lines that currently point only to OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid confusion with the new PEEK experiments: either generalize to a pattern like "Reference docs for active experiments: docs/{experiment-name}/" or add an explicit note that OBLIQ is one example and include/mention peek-bench (PEEK) as another example; make the edit in CLAUDE.md near the methodology/reference section so readers see the generalized wording or the added PEEK mention.
19-42: ⚡ Quick winConsider documenting peek_integration.py in the structure.
The project structure lists core modules but doesn't mention
peek_integration.py, which is added in this PR. Since it's an optional integration module, you might want to either:
- Add it to the structure with a note that it's optional
- Add a note explaining that optional modules are not listed here
This helps future contributors understand where to find the PEEK integration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CLAUDE.md` around lines 19 - 42, Update the project structure in CLAUDE.md to include the new optional integration module by either adding a line for peek_integration.py under src/pyrlm_runtime/ (e.g., "peek_integration.py # Optional PEEK integration") or by adding a short note above the tree explaining that optional integration modules (like peek_integration.py) are not listed by default; reference the filename peek_integration.py so readers can locate the PEEK integration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/peek_bench/run_oolong_rlm_vs_peek.py`:
- Around line 249-262: The except path can retry session.update_from_run and
re-raise if the first update (after rlm.run) was the source of the exception;
fix by tracking whether update_from_run was already attempted and avoid retrying
(and also guard update calls with their own try/except). Concretely, introduce a
local flag (e.g., updated_from_run = False) before calling rlm.run; after
rlm.run set updated_from_run = True only if session.update_from_run(trace,
query=question) completes successfully (wrap that call in its own try/except and
log/suppress any errors); in the outer except, only attempt
session.update_from_run(trace, query=question) if trace is not None and
updated_from_run is False, and again wrap it in try/except to avoid cascading
failures. Ensure references to rlm.run and session.update_from_run and the trace
variable are used so the changes are applied to the correct code paths.
In `@tests/test_peek_integration.py`:
- Around line 261-267: The test
test_system_prompt_supplement_nonempty_after_update is too permissive because it
allows an empty supplement; after calling
session.update_from_run(_make_simple_trace(), query="Q") assert that
session.system_prompt_supplement (supp) is non-empty and contains the expected
map marker instead of permitting "". Replace the current assertion with a direct
check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so
the test fails if the Cartographer stopped populating the map.
- Around line 322-327: Rename and rework the test
'test_load_raises_without_peek' so it actually simulates peek being unavailable
and asserts the ImportError instead of duplicating the happy-path create check:
monkeypatch the module helper '_require_peek' (or whichever internal function
enforces peek import) to raise ImportError, then call the public API under test
(either PeekSession.load() if present or PeekSession.create(...) which should
call the loader) with a FakeAdapter and assert that an ImportError is raised;
remove the current happy-path assertions that use FakeAdapter without triggering
the import failure.
---
Nitpick comments:
In `@CLAUDE.md`:
- Around line 113-114: Update the reference lines that currently point only to
OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid confusion with
the new PEEK experiments: either generalize to a pattern like "Reference docs
for active experiments: docs/{experiment-name}/" or add an explicit note that
OBLIQ is one example and include/mention peek-bench (PEEK) as another example;
make the edit in CLAUDE.md near the methodology/reference section so readers see
the generalized wording or the added PEEK mention.
- Around line 19-42: Update the project structure in CLAUDE.md to include the
new optional integration module by either adding a line for peek_integration.py
under src/pyrlm_runtime/ (e.g., "peek_integration.py # Optional PEEK
integration") or by adding a short note above the tree explaining that optional
integration modules (like peek_integration.py) are not listed by default;
reference the filename peek_integration.py so readers can locate the PEEK
integration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5a4e211-0b4e-415e-8f5c-d3580dcb4752
📒 Files selected for processing (8)
.gitignoreCLAUDE.mddocs/peek-bench/PEEK-EXPERIMENTS.mdexamples/peek_bench/__init__.pyexamples/peek_bench/run_oolong_rlm_vs_peek.pypyproject.tomlsrc/pyrlm_runtime/peek_integration.pytests/test_peek_integration.py
| try: | ||
| output, trace = rlm.run(question, context) | ||
| session.update_from_run(trace, query=question) | ||
| return ( | ||
| output or "", | ||
| _trace_step_count(trace), | ||
| _trace_total_tokens(trace), | ||
| time.time() - start, | ||
| None, | ||
| ) | ||
| except Exception as exc: | ||
| if trace is not None: | ||
| session.update_from_run(trace, query=question) | ||
| return ( |
There was a problem hiding this comment.
update_from_run can be retried after failing, causing cascading failure.
If the initial exception comes from session.update_from_run(...) (not rlm.run), the except path calls update_from_run again and can raise again, breaking the benchmark loop.
Suggested fix
- try:
- output, trace = rlm.run(question, context)
- session.update_from_run(trace, query=question)
+ try:
+ output, trace = rlm.run(question, context)
+ try:
+ session.update_from_run(trace, query=question)
+ except Exception:
+ # Keep benchmark query result; treat map-update failures as non-fatal here.
+ pass
return (
output or "",
_trace_step_count(trace),
_trace_total_tokens(trace),
time.time() - start,
None,
)
except Exception as exc:
- if trace is not None:
- session.update_from_run(trace, query=question)
return (
"",
_trace_step_count(trace),
_trace_total_tokens(trace),
time.time() - start,
str(exc),
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/peek_bench/run_oolong_rlm_vs_peek.py` around lines 249 - 262, The
except path can retry session.update_from_run and re-raise if the first update
(after rlm.run) was the source of the exception; fix by tracking whether
update_from_run was already attempted and avoid retrying (and also guard update
calls with their own try/except). Concretely, introduce a local flag (e.g.,
updated_from_run = False) before calling rlm.run; after rlm.run set
updated_from_run = True only if session.update_from_run(trace, query=question)
completes successfully (wrap that call in its own try/except and log/suppress
any errors); in the outer except, only attempt session.update_from_run(trace,
query=question) if trace is not None and updated_from_run is False, and again
wrap it in try/except to avoid cascading failures. Ensure references to rlm.run
and session.update_from_run and the trace variable are used so the changes are
applied to the correct code paths.
| def test_system_prompt_supplement_nonempty_after_update(self) -> None: | ||
| session = self._session_with_stub(token_budget=1024) | ||
| session.update_from_run(_make_simple_trace(), query="Q") | ||
| supp = session.system_prompt_supplement | ||
| # After update, the Cartographer added a roadmap entry — map is non-empty | ||
| assert supp == "" or "Context Map" in supp | ||
|
|
There was a problem hiding this comment.
This assertion is too permissive and can hide regressions.
After update_from_run() with the stub responses, allowing supp == "" makes the test pass even if map population breaks. Assert non-empty map text directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_peek_integration.py` around lines 261 - 267, The test
test_system_prompt_supplement_nonempty_after_update is too permissive because it
allows an empty supplement; after calling
session.update_from_run(_make_simple_trace(), query="Q") assert that
session.system_prompt_supplement (supp) is non-empty and contains the expected
map marker instead of permitting "". Replace the current assertion with a direct
check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so
the test fails if the Cartographer stopped populating the map.
| def test_load_raises_without_peek(self, monkeypatch) -> None: | ||
| # When peek is importable but we simulate the ImportError path via _require_peek | ||
| # This test just verifies the happy path of create() doesn't crash. | ||
| fake = FakeAdapter(script=[_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE]) | ||
| session = PeekSession.create(fake, token_budget=512) | ||
| assert session is not None |
There was a problem hiding this comment.
Test name/intent does not match behavior being tested.
test_load_raises_without_peek never calls load() and never verifies a raise path; it duplicates a happy-path create check. Please rewrite it to actually simulate missing peek and assert the expected ImportError from load()/create().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_peek_integration.py` around lines 322 - 327, Rename and rework the
test 'test_load_raises_without_peek' so it actually simulates peek being
unavailable and asserts the ImportError instead of duplicating the happy-path
create check: monkeypatch the module helper '_require_peek' (or whichever
internal function enforces peek import) to raise ImportError, then call the
public API under test (either PeekSession.load() if present or
PeekSession.create(...) which should call the loader) with a FakeAdapter and
assert that an ImportError is raised; remove the current happy-path assertions
that use FakeAdapter without triggering the import failure.
|
Closing in favor of this: #34 |
Summary
Adds optional integration with PEEK
(Gu et al., MIT CSAIL) — a context map injected into the system prompt
that accumulates reusable structural knowledge about a recurring
external context across multiple RLM runs.
PeekSessionwraps peek-ai'sCachePolicyand bridges ourModelAdapterto peek'sLMClientprotocol.system_prompt_supplementslot onRLM— no changes to the coreRLMAPI.peek-aiis shipped as an optional extra (not currently on PyPI;install with
pip install git+https://github.com/zhuohangu/peek.git).What's included
src/pyrlm_runtime/peek_integration.py— adapter,PeekSession,trace_to_peek_trajectoryserializertests/test_peek_integration.py— 21 tests, FakeAdapter-based,skip gracefully without peek-ai
examples/peek_bench/run_oolong_rlm_vs_peek.py— benchmark harnesson
oolongbench/oolong-synth(RLM vs RLM+PEEK)docs/peek-bench/PEEK-EXPERIMENTS.md— pre-committed hypothesis,decision rule, and results from three runs
Benchmark result: REJECT (do not merge as default behavior)
Aggregate verdict across all phases on
oolongbench/oolong-synthisREJECT against the pre-committed threshold (Δ score ≥ +5pp). PEEK is
shipped as optional only, not wired into any default code path.
Phase 3 (N=5 largest counting contexts, fully online) shows the effect
is sharply dataset-dependent:
agnewsreplicates the paper's Table 1 finding. Other sub-datasetsdo not, with two distinct failure modes (premature termination on
imdb, misleading map content on metaphors). Full diagnostics, scorer
validation against
abertsch72/oolong, and upstream integrationaudit are in
docs/peek-bench/PEEK-EXPERIMENTS.md.Test plan
uv sync— verify no regression on existing depsuv pip install "git+https://github.com/zhuohangu/peek.git"— install peek-aiuv run pytest tests/test_peek_integration.py -v— 21 tests pass with peek-ai installeduv run pytest— full test suite still greenuv run ruff check src/ tests/ examples/— lint cleandocs/peek-bench/PEEK-EXPERIMENTS.mdfor the experimental writeupSummary by CodeRabbit
New Features
Documentation
Chores