Skip to content

feat(peek): add optional PEEK orientation-cache integration#33

Closed
apenab wants to merge 2 commits into
mainfrom
feat/peek-integration
Closed

feat(peek): add optional PEEK orientation-cache integration#33
apenab wants to merge 2 commits into
mainfrom
feat/peek-integration

Conversation

@apenab

@apenab apenab commented May 21, 2026

Copy link
Copy Markdown
Owner

Summary

image

Adds optional integration with PEEK
(Gu et al., MIT CSAIL) — a context map injected into the system prompt
that accumulates reusable structural knowledge about a recurring
external context across multiple RLM runs.

  • PeekSession wraps peek-ai's CachePolicy and bridges our
    ModelAdapter to peek's LMClient protocol.
  • Map is injected via the existing system_prompt_supplement slot on
    RLM — no changes to the core RLM API.
  • peek-ai is shipped as an optional extra (not currently on PyPI;
    install with pip install git+https://github.com/zhuohangu/peek.git).

What's included

  • src/pyrlm_runtime/peek_integration.py — adapter, PeekSession,
    trace_to_peek_trajectory serializer
  • tests/test_peek_integration.py — 21 tests, FakeAdapter-based,
    skip gracefully without peek-ai
  • examples/peek_bench/run_oolong_rlm_vs_peek.py — benchmark harness
    on oolongbench/oolong-synth (RLM vs RLM+PEEK)
  • docs/peek-bench/PEEK-EXPERIMENTS.md — pre-committed hypothesis,
    decision rule, and results from three runs

Benchmark result: REJECT (do not merge as default behavior)

Aggregate verdict across all phases on oolongbench/oolong-synth is
REJECT against the pre-committed threshold (Δ score ≥ +5pp). PEEK is
shipped as optional only, not wired into any default code path.

Phase 3 (N=5 largest counting contexts, fully online) shows the effect
is sharply dataset-dependent:

Sub-dataset Δ score Δ steps/q Δ tokens/q
agnews +7.6pp -1.3 -1,822
negation +4.0pp -0.7 -21
imdb -4.2pp -18.1 -9,609
multinli -5.7pp -1.8 -2,454
metaphors -10.7pp -0.3 -887

agnews replicates the paper's Table 1 finding. Other sub-datasets
do not, with two distinct failure modes (premature termination on
imdb, misleading map content on metaphors). Full diagnostics, scorer
validation against abertsch72/oolong, and upstream integration
audit are in docs/peek-bench/PEEK-EXPERIMENTS.md.

Test plan

  • uv sync — verify no regression on existing deps
  • uv pip install "git+https://github.com/zhuohangu/peek.git" — install peek-ai
  • uv run pytest tests/test_peek_integration.py -v — 21 tests pass with peek-ai installed
  • uv run pytest — full test suite still green
  • uv run ruff check src/ tests/ examples/ — lint clean
  • Review docs/peek-bench/PEEK-EXPERIMENTS.md for the experimental writeup

Summary by CodeRabbit

  • New Features

    • Added PEEK Session integration enabling persistent context caching across queries.
    • Added benchmark script to compare baseline performance against PEEK-enhanced approaches.
  • Documentation

    • Added project development conventions documentation.
    • Added comprehensive experiment methodology and results documentation.
  • Chores

    • Updated project configuration for optional dependency support.

Review Change Stack

github-actions Bot added 2 commits May 21, 2026 13:17
Add PEEK (arXiv:2605.19932) orientation-cache support as an optional
extra. `PeekSession` wraps peek-ai's `CachePolicy` and bridges our
`ModelAdapter` to peek's `LMClient` protocol. The map is injected via
the existing `system_prompt_supplement` slot on `RLM`; no changes to
the core API.

- src/pyrlm_runtime/peek_integration.py: adapter, `PeekSession`, and
  `trace_to_peek_trajectory` for serializing a `Trace` into the
  plain-string trajectory format peek's Distiller expects.
- tests/test_peek_integration.py: 21 tests using FakeAdapter and a
  stub LMClient; skips PEEK-dependent tests gracefully when peek-ai
  is not installed.
- examples/peek_bench/: benchmark harness comparing baseline RLM vs
  RLM+PEEK on oolongbench/oolong-synth. Supports explicit context
  selection (`--context-ids`) and fully-online evolution
  (`--evolve-steps -1`).
- docs/peek-bench/PEEK-EXPERIMENTS.md: pre-committed hypothesis,
  decision rule, and results from three runs. Aggregate verdict
  is REJECT across all phases on oolong-synth. Phase 3 (N=5 largest
  contexts, fully-online) shows the effect is sharply
  dataset-dependent: agnews +7.6pp, negation +4.0pp, imdb -4.2pp,
  multinli -5.7pp, metaphors -10.7pp. agnews replication is
  consistent with the paper's Table 1.

peek-ai is not on PyPI; install with
`pip install git+https://github.com/zhuohangu/peek.git`.
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR introduces a complete PEEK integration for pyrlm-runtime: a new optional library module enabling persistent context caching across RLM queries, a benchmark runner comparing baseline versus PEEK-augmented performance on the oolong-synth dataset, comprehensive tests, and documentation of project conventions and experimental results.

Changes

PEEK Integration and Benchmarking

Layer / File(s) Summary
PEEK Library Core
src/pyrlm_runtime/peek_integration.py, pyproject.toml
PeekSession manages context map state across RLM runs using a wrapped peek.CachePolicy; _PeekLMClientAdapter adapts a ModelAdapter to PEEK's LM client interface with token tracking; trace_to_peek_trajectory() converts Trace objects to PEEK's trajectory format; persistence via save() and load(); optional dependency declared in config.
PEEK Integration Tests
tests/test_peek_integration.py
Tests validate trajectory formatting for empty/populated traces with query prepending and depth markers; _PeekLMClientAdapter token mapping and passthrough; PeekSession state updates, step counting, evolution freezing semantics, empty-trace handling, and save()/load() round-tripping; regression test confirms trajectory data flows to policy.
Benchmark Comparison Script
examples/peek_bench/run_oolong_rlm_vs_peek.py
Standalone benchmark comparing baseline RLM (fresh model per query) to PEEK-augmented RLM; loads Azure config from environment; reads oolong-synth dataset, groups by context, filters to ≥5-query groups, selects contexts by seed or explicit IDs; runs both modes per query, computes type-aware example scores (exact match, numeric decay, date parsing), tracks steps/tokens from traces, saves peek maps, aggregates results with delta computation, applies pre-committed decision rule (PROMOTE/HOLD/REJECT based on score/steps/token thresholds).
Project Conventions and Experiment Documentation
.gitignore, CLAUDE.md, docs/peek-bench/PEEK-EXPERIMENTS.md
.gitignore ignores benchmark run artifacts; CLAUDE.md documents dependency/test/build commands, repository structure, code style (Ruff), testing practices (FakeAdapter, MontyREPL skips), library design constraints (keep domain logic in consumers), and empirical research procedure with pre-committed decision rule and conservative editing guidance; PEEK-EXPERIMENTS.md documents experimental objective, hypothesis, decision-rule thresholds, scorer validation, three-phase run commands/results, PROMOTE/HOLD/REJECT outcomes, cross-run diagnosis (ceiling effects, evolve_steps starvation, model mismatch), follow-ups, and integration audit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through traces bright,
Wraps PEEK in layers, shining light,
From sessions born, benchmarks run—
Context caching's just begun! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding optional PEEK orientation-cache integration to the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/peek-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
CLAUDE.md (2)

113-114: ⚡ Quick win

Consider clarifying the OBLIQ reference in a PEEK PR.

These lines reference docs/obliq-bench/OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md as examples, but this PR introduces PEEK experiments. Either:

  • Generalize these references (e.g., "Reference docs for active experiments: docs/{experiment-name}/ directories")
  • Add a note that OBLIQ is just one example
  • Update to reference peek-bench as well

This would reduce confusion for readers who see PEEK files in the PR but OBLIQ references in the methodology section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` around lines 113 - 114, Update the reference lines that currently
point only to OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid
confusion with the new PEEK experiments: either generalize to a pattern like
"Reference docs for active experiments: docs/{experiment-name}/" or add an
explicit note that OBLIQ is one example and include/mention peek-bench (PEEK) as
another example; make the edit in CLAUDE.md near the methodology/reference
section so readers see the generalized wording or the added PEEK mention.

19-42: ⚡ Quick win

Consider documenting peek_integration.py in the structure.

The project structure lists core modules but doesn't mention peek_integration.py, which is added in this PR. Since it's an optional integration module, you might want to either:

  • Add it to the structure with a note that it's optional
  • Add a note explaining that optional modules are not listed here

This helps future contributors understand where to find the PEEK integration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` around lines 19 - 42, Update the project structure in CLAUDE.md to
include the new optional integration module by either adding a line for
peek_integration.py under src/pyrlm_runtime/ (e.g., "peek_integration.py  #
Optional PEEK integration") or by adding a short note above the tree explaining
that optional integration modules (like peek_integration.py) are not listed by
default; reference the filename peek_integration.py so readers can locate the
PEEK integration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/peek_bench/run_oolong_rlm_vs_peek.py`:
- Around line 249-262: The except path can retry session.update_from_run and
re-raise if the first update (after rlm.run) was the source of the exception;
fix by tracking whether update_from_run was already attempted and avoid retrying
(and also guard update calls with their own try/except). Concretely, introduce a
local flag (e.g., updated_from_run = False) before calling rlm.run; after
rlm.run set updated_from_run = True only if session.update_from_run(trace,
query=question) completes successfully (wrap that call in its own try/except and
log/suppress any errors); in the outer except, only attempt
session.update_from_run(trace, query=question) if trace is not None and
updated_from_run is False, and again wrap it in try/except to avoid cascading
failures. Ensure references to rlm.run and session.update_from_run and the trace
variable are used so the changes are applied to the correct code paths.

In `@tests/test_peek_integration.py`:
- Around line 261-267: The test
test_system_prompt_supplement_nonempty_after_update is too permissive because it
allows an empty supplement; after calling
session.update_from_run(_make_simple_trace(), query="Q") assert that
session.system_prompt_supplement (supp) is non-empty and contains the expected
map marker instead of permitting "". Replace the current assertion with a direct
check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so
the test fails if the Cartographer stopped populating the map.
- Around line 322-327: Rename and rework the test
'test_load_raises_without_peek' so it actually simulates peek being unavailable
and asserts the ImportError instead of duplicating the happy-path create check:
monkeypatch the module helper '_require_peek' (or whichever internal function
enforces peek import) to raise ImportError, then call the public API under test
(either PeekSession.load() if present or PeekSession.create(...) which should
call the loader) with a FakeAdapter and assert that an ImportError is raised;
remove the current happy-path assertions that use FakeAdapter without triggering
the import failure.

---

Nitpick comments:
In `@CLAUDE.md`:
- Around line 113-114: Update the reference lines that currently point only to
OBLIQ docs (OBLIQ-OBJETIVO.md and OBLIQ-EXPERIMENTS.md) to avoid confusion with
the new PEEK experiments: either generalize to a pattern like "Reference docs
for active experiments: docs/{experiment-name}/" or add an explicit note that
OBLIQ is one example and include/mention peek-bench (PEEK) as another example;
make the edit in CLAUDE.md near the methodology/reference section so readers see
the generalized wording or the added PEEK mention.
- Around line 19-42: Update the project structure in CLAUDE.md to include the
new optional integration module by either adding a line for peek_integration.py
under src/pyrlm_runtime/ (e.g., "peek_integration.py  # Optional PEEK
integration") or by adding a short note above the tree explaining that optional
integration modules (like peek_integration.py) are not listed by default;
reference the filename peek_integration.py so readers can locate the PEEK
integration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5a4e211-0b4e-415e-8f5c-d3580dcb4752

📥 Commits

Reviewing files that changed from the base of the PR and between d4d34d3 and 1edf16c.

📒 Files selected for processing (8)
  • .gitignore
  • CLAUDE.md
  • docs/peek-bench/PEEK-EXPERIMENTS.md
  • examples/peek_bench/__init__.py
  • examples/peek_bench/run_oolong_rlm_vs_peek.py
  • pyproject.toml
  • src/pyrlm_runtime/peek_integration.py
  • tests/test_peek_integration.py

Comment on lines +249 to +262
try:
output, trace = rlm.run(question, context)
session.update_from_run(trace, query=question)
return (
output or "",
_trace_step_count(trace),
_trace_total_tokens(trace),
time.time() - start,
None,
)
except Exception as exc:
if trace is not None:
session.update_from_run(trace, query=question)
return (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

update_from_run can be retried after failing, causing cascading failure.

If the initial exception comes from session.update_from_run(...) (not rlm.run), the except path calls update_from_run again and can raise again, breaking the benchmark loop.

Suggested fix
-    try:
-        output, trace = rlm.run(question, context)
-        session.update_from_run(trace, query=question)
+    try:
+        output, trace = rlm.run(question, context)
+        try:
+            session.update_from_run(trace, query=question)
+        except Exception:
+            # Keep benchmark query result; treat map-update failures as non-fatal here.
+            pass
         return (
             output or "",
             _trace_step_count(trace),
             _trace_total_tokens(trace),
             time.time() - start,
             None,
         )
     except Exception as exc:
-        if trace is not None:
-            session.update_from_run(trace, query=question)
         return (
             "",
             _trace_step_count(trace),
             _trace_total_tokens(trace),
             time.time() - start,
             str(exc),
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/peek_bench/run_oolong_rlm_vs_peek.py` around lines 249 - 262, The
except path can retry session.update_from_run and re-raise if the first update
(after rlm.run) was the source of the exception; fix by tracking whether
update_from_run was already attempted and avoid retrying (and also guard update
calls with their own try/except). Concretely, introduce a local flag (e.g.,
updated_from_run = False) before calling rlm.run; after rlm.run set
updated_from_run = True only if session.update_from_run(trace, query=question)
completes successfully (wrap that call in its own try/except and log/suppress
any errors); in the outer except, only attempt session.update_from_run(trace,
query=question) if trace is not None and updated_from_run is False, and again
wrap it in try/except to avoid cascading failures. Ensure references to rlm.run
and session.update_from_run and the trace variable are used so the changes are
applied to the correct code paths.

Comment on lines +261 to +267
def test_system_prompt_supplement_nonempty_after_update(self) -> None:
session = self._session_with_stub(token_budget=1024)
session.update_from_run(_make_simple_trace(), query="Q")
supp = session.system_prompt_supplement
# After update, the Cartographer added a roadmap entry — map is non-empty
assert supp == "" or "Context Map" in supp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This assertion is too permissive and can hide regressions.

After update_from_run() with the stub responses, allowing supp == "" makes the test pass even if map population breaks. Assert non-empty map text directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_peek_integration.py` around lines 261 - 267, The test
test_system_prompt_supplement_nonempty_after_update is too permissive because it
allows an empty supplement; after calling
session.update_from_run(_make_simple_trace(), query="Q") assert that
session.system_prompt_supplement (supp) is non-empty and contains the expected
map marker instead of permitting "". Replace the current assertion with a direct
check such as asserting "Context Map" is in supp (and/or supp.strip() != "") so
the test fails if the Cartographer stopped populating the map.

Comment on lines +322 to +327
def test_load_raises_without_peek(self, monkeypatch) -> None:
# When peek is importable but we simulate the ImportError path via _require_peek
# This test just verifies the happy path of create() doesn't crash.
fake = FakeAdapter(script=[_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE])
session = PeekSession.create(fake, token_budget=512)
assert session is not None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test name/intent does not match behavior being tested.

test_load_raises_without_peek never calls load() and never verifies a raise path; it duplicates a happy-path create check. Please rewrite it to actually simulate missing peek and assert the expected ImportError from load()/create().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_peek_integration.py` around lines 322 - 327, Rename and rework the
test 'test_load_raises_without_peek' so it actually simulates peek being
unavailable and asserts the ImportError instead of duplicating the happy-path
create check: monkeypatch the module helper '_require_peek' (or whichever
internal function enforces peek import) to raise ImportError, then call the
public API under test (either PeekSession.load() if present or
PeekSession.create(...) which should call the loader) with a FakeAdapter and
assert that an ImportError is raised; remove the current happy-path assertions
that use FakeAdapter without triggering the import failure.

@apenab

apenab commented May 29, 2026

Copy link
Copy Markdown
Owner Author

Closing in favor of this: #34

@apenab apenab closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant