feat(peek): add optional PEEK orientation-cache integration by apenab · Pull Request #33 · apenab/pyrlm-runtime

apenab · 2026-05-21T19:13:04Z

Summary

Adds optional integration with PEEK
(Gu et al., MIT CSAIL) — a context map injected into the system prompt
that accumulates reusable structural knowledge about a recurring
external context across multiple RLM runs.

PeekSession wraps peek-ai's CachePolicy and bridges our
ModelAdapter to peek's LMClient protocol.
Map is injected via the existing system_prompt_supplement slot on
RLM — no changes to the core RLM API.
peek-ai is shipped as an optional extra (not currently on PyPI;
install with pip install git+https://github.com/zhuohangu/peek.git).

What's included

src/pyrlm_runtime/peek_integration.py — adapter, PeekSession,
trace_to_peek_trajectory serializer
tests/test_peek_integration.py — 21 tests, FakeAdapter-based,
skip gracefully without peek-ai
examples/peek_bench/run_oolong_rlm_vs_peek.py — benchmark harness
on oolongbench/oolong-synth (RLM vs RLM+PEEK)
docs/peek-bench/PEEK-EXPERIMENTS.md — pre-committed hypothesis,
decision rule, and results from three runs

Benchmark result: REJECT (do not merge as default behavior)

Aggregate verdict across all phases on oolongbench/oolong-synth is
REJECT against the pre-committed threshold (Δ score ≥ +5pp). PEEK is
shipped as optional only, not wired into any default code path.

Phase 3 (N=5 largest counting contexts, fully online) shows the effect
is sharply dataset-dependent:

Sub-dataset	Δ score	Δ steps/q	Δ tokens/q
agnews	+7.6pp	-1.3	-1,822
negation	+4.0pp	-0.7	-21
imdb	-4.2pp	-18.1	-9,609
multinli	-5.7pp	-1.8	-2,454
metaphors	-10.7pp	-0.3	-887

agnews replicates the paper's Table 1 finding. Other sub-datasets
do not, with two distinct failure modes (premature termination on
imdb, misleading map content on metaphors). Full diagnostics, scorer
validation against abertsch72/oolong, and upstream integration
audit are in docs/peek-bench/PEEK-EXPERIMENTS.md.

Test plan

uv sync — verify no regression on existing deps
uv pip install "git+https://github.com/zhuohangu/peek.git" — install peek-ai
uv run pytest tests/test_peek_integration.py -v — 21 tests pass with peek-ai installed
uv run pytest — full test suite still green
uv run ruff check src/ tests/ examples/ — lint clean
Review docs/peek-bench/PEEK-EXPERIMENTS.md for the experimental writeup

Summary by CodeRabbit

New Features
- Added PEEK Session integration enabling persistent context caching across queries.
- Added benchmark script to compare baseline performance against PEEK-enhanced approaches.
Documentation
- Added project development conventions documentation.
- Added comprehensive experiment methodology and results documentation.
Chores
- Updated project configuration for optional dependency support.

Add PEEK (arXiv:2605.19932) orientation-cache support as an optional extra. `PeekSession` wraps peek-ai's `CachePolicy` and bridges our `ModelAdapter` to peek's `LMClient` protocol. The map is injected via the existing `system_prompt_supplement` slot on `RLM`; no changes to the core API. - src/pyrlm_runtime/peek_integration.py: adapter, `PeekSession`, and `trace_to_peek_trajectory` for serializing a `Trace` into the plain-string trajectory format peek's Distiller expects. - tests/test_peek_integration.py: 21 tests using FakeAdapter and a stub LMClient; skips PEEK-dependent tests gracefully when peek-ai is not installed. - examples/peek_bench/: benchmark harness comparing baseline RLM vs RLM+PEEK on oolongbench/oolong-synth. Supports explicit context selection (`--context-ids`) and fully-online evolution (`--evolve-steps -1`). - docs/peek-bench/PEEK-EXPERIMENTS.md: pre-committed hypothesis, decision rule, and results from three runs. Aggregate verdict is REJECT across all phases on oolong-synth. Phase 3 (N=5 largest contexts, fully-online) shows the effect is sharply dataset-dependent: agnews +7.6pp, negation +4.0pp, imdb -4.2pp, multinli -5.7pp, metaphors -10.7pp. agnews replication is consistent with the paper's Table 1. peek-ai is not on PyPI; install with `pip install git+https://github.com/zhuohangu/peek.git`.

coderabbitai · 2026-05-21T19:13:47Z

📝 Walkthrough

Walkthrough

This PR introduces a complete PEEK integration for pyrlm-runtime: a new optional library module enabling persistent context caching across RLM queries, a benchmark runner comparing baseline versus PEEK-augmented performance on the oolong-synth dataset, comprehensive tests, and documentation of project conventions and experimental results.

Changes

PEEK Integration and Benchmarking

Layer / File(s)	Summary
PEEK Library Core `src/pyrlm_runtime/peek_integration.py`, `pyproject.toml`	`PeekSession` manages context map state across RLM runs using a wrapped `peek.CachePolicy`; `_PeekLMClientAdapter` adapts a `ModelAdapter` to PEEK's LM client interface with token tracking; `trace_to_peek_trajectory()` converts `Trace` objects to PEEK's trajectory format; persistence via `save()` and `load()`; optional dependency declared in config.
PEEK Integration Tests `tests/test_peek_integration.py`	Tests validate trajectory formatting for empty/populated traces with query prepending and depth markers; `_PeekLMClientAdapter` token mapping and passthrough; `PeekSession` state updates, step counting, evolution freezing semantics, empty-trace handling, and `save()`/`load()` round-tripping; regression test confirms trajectory data flows to policy.
Benchmark Comparison Script `examples/peek_bench/run_oolong_rlm_vs_peek.py`	Standalone benchmark comparing baseline RLM (fresh model per query) to PEEK-augmented RLM; loads Azure config from environment; reads oolong-synth dataset, groups by context, filters to ≥5-query groups, selects contexts by seed or explicit IDs; runs both modes per query, computes type-aware example scores (exact match, numeric decay, date parsing), tracks steps/tokens from traces, saves peek maps, aggregates results with delta computation, applies pre-committed decision rule (PROMOTE/HOLD/REJECT based on score/steps/token thresholds).
Project Conventions and Experiment Documentation `.gitignore`, `CLAUDE.md`, `docs/peek-bench/PEEK-EXPERIMENTS.md`	`.gitignore` ignores benchmark run artifacts; `CLAUDE.md` documents dependency/test/build commands, repository structure, code style (Ruff), testing practices (`FakeAdapter`, MontyREPL skips), library design constraints (keep domain logic in consumers), and empirical research procedure with pre-committed decision rule and conservative editing guidance; `PEEK-EXPERIMENTS.md` documents experimental objective, hypothesis, decision-rule thresholds, scorer validation, three-phase run commands/results, PROMOTE/HOLD/REJECT outcomes, cross-run diagnosis (ceiling effects, evolve_steps starvation, model mismatch), follow-ups, and integration audit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through traces bright,
Wraps PEEK in layers, shining light,
From sessions born, benchmarks run—
Context caching's just begun! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding optional PEEK orientation-cache integration to the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/peek-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

CLAUDE.md (2)
113-114: ⚡ Quick win

Consider clarifying the OBLIQ reference in a PEEK PR.

These lines reference docs/obliq-bench/OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md as examples, but this PR introduces PEEK experiments. Either:

Generalize these references (e.g., "Reference docs for active experiments: docs/{experiment-name}/ directories")

Add a note that OBLIQ is just one example

Update to reference peek-bench as well

This would reduce confusion for readers who see PEEK files in the PR but OBLIQ references in the methodology section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` around lines 113 - 114, Update the reference lines that currently
point only to OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid
confusion with the new PEEK experiments: either generalize to a pattern like
"Reference docs for active experiments: docs/{experiment-name}/" or add an
explicit note that OBLIQ is one example and include/mention peek-bench (PEEK) as
another example; make the edit in CLAUDE.md near the methodology/reference
section so readers see the generalized wording or the added PEEK mention.
19-42: ⚡ Quick win

Consider documenting peek_integration.py in the structure.

The project structure lists core modules but doesn't mention peek_integration.py, which is added in this PR. Since it's an optional integration module, you might want to either:

Add it to the structure with a note that it's optional

Add a note explaining that optional modules are not listed here

This helps future contributors understand where to find the PEEK integration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` around lines 19 - 42, Update the project structure in CLAUDE.md to
include the new optional integration module by either adding a line for
peek_integration.py under src/pyrlm_runtime/ (e.g., "peek_integration.py  #
Optional PEEK integration") or by adding a short note above the tree explaining
that optional integration modules (like peek_integration.py) are not listed by
default; reference the filename peek_integration.py so readers can locate the
PEEK integration.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/peek_bench/run_oolong_rlm_vs_peek.py`:
- Around line 249-262: The except path can retry session.update_from_run and
re-raise if the first update (after rlm.run) was the source of the exception;
fix by tracking whether update_from_run was already attempted and avoid retrying
(and also guard update calls with their own try/except). Concretely, introduce a
local flag (e.g., updated_from_run = False) before calling rlm.run; after
rlm.run set updated_from_run = True only if session.update_from_run(trace,
query=question) completes successfully (wrap that call in its own try/except and
log/suppress any errors); in the outer except, only attempt
session.update_from_run(trace, query=question) if trace is not None and
updated_from_run is False, and again wrap it in try/except to avoid cascading
failures. Ensure references to rlm.run and session.update_from_run and the trace
variable are used so the changes are applied to the correct code paths.

In `@tests/test_peek_integration.py`:
- Around line 261-267: The test
test_system_prompt_supplement_nonempty_after_update is too permissive because it
allows an empty supplement; after calling
session.update_from_run(_make_simple_trace(), query="Q") assert that
session.system_prompt_supplement (supp) is non-empty and contains the expected
map marker instead of permitting "". Replace the current assertion with a direct
check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so
the test fails if the Cartographer stopped populating the map.
- Around line 322-327: Rename and rework the test
'test_load_raises_without_peek' so it actually simulates peek being unavailable
and asserts the ImportError instead of duplicating the happy-path create check:
monkeypatch the module helper '_require_peek' (or whichever internal function
enforces peek import) to raise ImportError, then call the public API under test
(either PeekSession.load() if present or PeekSession.create(...) which should
call the loader) with a FakeAdapter and assert that an ImportError is raised;
remove the current happy-path assertions that use FakeAdapter without triggering
the import failure.

---

Nitpick comments:
In `@CLAUDE.md`:
- Around line 113-114: Update the reference lines that currently point only to
OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid confusion with
the new PEEK experiments: either generalize to a pattern like "Reference docs
for active experiments: docs/{experiment-name}/" or add an explicit note that
OBLIQ is one example and include/mention peek-bench (PEEK) as another example;
make the edit in CLAUDE.md near the methodology/reference section so readers see
the generalized wording or the added PEEK mention.
- Around line 19-42: Update the project structure in CLAUDE.md to include the
new optional integration module by either adding a line for peek_integration.py
under src/pyrlm_runtime/ (e.g., "peek_integration.py  # Optional PEEK
integration") or by adding a short note above the tree explaining that optional
integration modules (like peek_integration.py) are not listed by default;
reference the filename peek_integration.py so readers can locate the PEEK
integration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5a4e211-0b4e-415e-8f5c-d3580dcb4752

📥 Commits

Reviewing files that changed from the base of the PR and between d4d34d3 and 1edf16c.

📒 Files selected for processing (8)

.gitignore
CLAUDE.md
docs/peek-bench/PEEK-EXPERIMENTS.md
examples/peek_bench/__init__.py
examples/peek_bench/run_oolong_rlm_vs_peek.py
pyproject.toml
src/pyrlm_runtime/peek_integration.py
tests/test_peek_integration.py

coderabbitai · 2026-05-21T19:20:29Z

+    try:
+        output, trace = rlm.run(question, context)
+        session.update_from_run(trace, query=question)
+        return (
+            output or "",
+            _trace_step_count(trace),
+            _trace_total_tokens(trace),
+            time.time() - start,
+            None,
+        )
+    except Exception as exc:
+        if trace is not None:
+            session.update_from_run(trace, query=question)
+        return (


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

update_from_run can be retried after failing, causing cascading failure.

If the initial exception comes from session.update_from_run(...) (not rlm.run), the except path calls update_from_run again and can raise again, breaking the benchmark loop.

Suggested fix

- try: - output, trace = rlm.run(question, context) - session.update_from_run(trace, query=question) + try: + output, trace = rlm.run(question, context) + try: + session.update_from_run(trace, query=question) + except Exception: + # Keep benchmark query result; treat map-update failures as non-fatal here. + pass return ( output or "", _trace_step_count(trace), _trace_total_tokens(trace), time.time() - start, None, ) except Exception as exc: - if trace is not None: - session.update_from_run(trace, query=question) return ( "", _trace_step_count(trace), _trace_total_tokens(trace), time.time() - start, str(exc), )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/peek_bench/run_oolong_rlm_vs_peek.py` around lines 249 - 262, The except path can retry session.update_from_run and re-raise if the first update (after rlm.run) was the source of the exception; fix by tracking whether update_from_run was already attempted and avoid retrying (and also guard update calls with their own try/except). Concretely, introduce a local flag (e.g., updated_from_run = False) before calling rlm.run; after rlm.run set updated_from_run = True only if session.update_from_run(trace, query=question) completes successfully (wrap that call in its own try/except and log/suppress any errors); in the outer except, only attempt session.update_from_run(trace, query=question) if trace is not None and updated_from_run is False, and again wrap it in try/except to avoid cascading failures. Ensure references to rlm.run and session.update_from_run and the trace variable are used so the changes are applied to the correct code paths.

coderabbitai · 2026-05-21T19:20:29Z

+    def test_system_prompt_supplement_nonempty_after_update(self) -> None:
+        session = self._session_with_stub(token_budget=1024)
+        session.update_from_run(_make_simple_trace(), query="Q")
+        supp = session.system_prompt_supplement
+        # After update, the Cartographer added a roadmap entry — map is non-empty
+        assert supp == "" or "Context Map" in supp
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This assertion is too permissive and can hide regressions.

After update_from_run() with the stub responses, allowing supp == "" makes the test pass even if map population breaks. Assert non-empty map text directly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_peek_integration.py` around lines 261 - 267, The test test_system_prompt_supplement_nonempty_after_update is too permissive because it allows an empty supplement; after calling session.update_from_run(_make_simple_trace(), query="Q") assert that session.system_prompt_supplement (supp) is non-empty and contains the expected map marker instead of permitting "". Replace the current assertion with a direct check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so the test fails if the Cartographer stopped populating the map.

coderabbitai · 2026-05-21T19:20:29Z

+    def test_load_raises_without_peek(self, monkeypatch) -> None:
+        # When peek is importable but we simulate the ImportError path via _require_peek
+        # This test just verifies the happy path of create() doesn't crash.
+        fake = FakeAdapter(script=[_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE])
+        session = PeekSession.create(fake, token_budget=512)
+        assert session is not None


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test name/intent does not match behavior being tested.

test_load_raises_without_peek never calls load() and never verifies a raise path; it duplicates a happy-path create check. Please rewrite it to actually simulate missing peek and assert the expected ImportError from load()/create().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_peek_integration.py` around lines 322 - 327, Rename and rework the test 'test_load_raises_without_peek' so it actually simulates peek being unavailable and asserts the ImportError instead of duplicating the happy-path create check: monkeypatch the module helper '_require_peek' (or whichever internal function enforces peek import) to raise ImportError, then call the public API under test (either PeekSession.load() if present or PeekSession.create(...) which should call the loader) with a FakeAdapter and assert that an ImportError is raised; remove the current happy-path assertions that use FakeAdapter without triggering the import failure.

apenab · 2026-05-29T14:50:28Z

Closing in favor of this: #34

github-actions Bot added 2 commits May 21, 2026 13:17

docs: add project conventions and guidelines in CLAUDE.md

e28f5f7

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

This was referenced May 22, 2026

Replication results on oolongbench/oolong-synth zhuohangu/peek#1

Open

feat(peek): vendor peek-ai with score-decay patch, tracing, and bug fix #34

Open

apenab closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(peek): add optional PEEK orientation-cache integration#33

feat(peek): add optional PEEK orientation-cache integration#33
apenab wants to merge 2 commits into
mainfrom
feat/peek-integration

apenab commented May 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

apenab commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apenab commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Benchmark result: REJECT (do not merge as default behavior)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

apenab commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apenab commented May 21, 2026 •

edited

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading