From 597db6fed952fbc8e7436c6a58849f66853b2d6e Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 25 May 2026 17:22:45 +0200 Subject: [PATCH 1/2] docs: add project conventions and guidelines in CLAUDE.md Documents the pyrlm-runtime project's structure, code style, testing conventions, REPL backends, design constraints (the "library is generic" rule), and the empirical research methodology that gates benchmark-driven work. --- CLAUDE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index 69c8e52..fbab49e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -100,7 +100,7 @@ approaches** (keywords: "probar", "medir", "benchmark", "experimento", "comparar contra baseline", "ver si mejora"). It does NOT apply to normal coding tasks (bug fixes, refactors, new features). -When triggered, follow this sequence without waiting to be asked: +When triggered, follow this sequence: 1. **Hypothesis first.** State what you expect to happen and why, before running anything. One sentence is enough. From 169853e28254d171f1557c32dcdeb6ac09b6bc56 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 25 May 2026 17:23:19 +0200 Subject: [PATCH 2/2] feat(peek): vendor peek-ai with score-decay patch, tracing, and bug fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds PEEK orientation-cache (arXiv:2605.19932) support to pyrlm-runtime, vendored at vendor/peek/ from zhuohangu/peek commit 57de91ac (Apache-2.0 preserved with VENDOR.md). PeekSession bridges our ModelAdapter to peek's LMClient protocol and injects the map through the existing system_prompt _supplement slot on RLM — no changes to the core API. What's included: - vendor/peek/: vendored upstream copy with two patches applied inline, documented with `# peek-patch` comments: * vendor/peek/core/evictor.py — score-decay patch (C3): existing scores are multiplied by SCORE_DECAY=0.85 before each tagging pass, and neutral tags contribute -NEUTRAL_PENALTY=-0.5 instead of 0. Constants 1.0/0.0 recover upstream behaviour exactly. * vendor/peek/_io.py — extract_json now scans fenced code blocks with an O(n) state machine instead of `re.findall(r"```...(.*?)\s*```", re.DOTALL | re.IGNORECASE)`. The regex catastrophically backtracks on LLM output with an unclosed fence; reproduced as a 45-minute hung benchmark run. - src/pyrlm_runtime/peek_integration.py: PeekSession, _PeekLMClientAdapter, and trace_to_peek_trajectory. Optional trace_dir writes per-update JSON snapshots (query, trajectory, map before/after with item scores, parsed Distiller and Cartographer outputs, diff) for offline diagnosis. - examples/peek_bench/run_oolong_rlm_vs_peek.py: benchmark harness comparing RLM vs RLM+PEEK on oolongbench/oolong-synth (the only OOLONG dataset on HuggingFace as of 2026-05-25). Supports explicit context selection, fully-online evolution, and per-query JSON tracing via --trace. - examples/peek_bench/analyze_peek_trace.py: diagnostic analyzer reporting per-query churn, Distiller tag distribution, Cartographer operation breakdown, per-item lifetimes, and duplicate rate. - tests/test_peek_integration.py: 24 tests (FakeAdapter / stub LMClient driven), including regression tests for the trace_dir behaviour and the catastrophic-backtracking bug fix. - docs/peek-bench/PEEK-EXPERIMENTS.md: full experiment log — pre-committed hypothesis and decision rule, Pilot, Phase 2 (N=27), Phase 3 (N=5 large contexts), Phase 4 (vendoring + C3 patch + C5' ablation + bug fix). - docs/peek-bench/PEEK-DIAGNOSIS.md: trace-based diagnosis of PEEK upstream failure modes — silent-failures audit (refuted), Distiller blindness (confirmed and addressed by C3), hallucination, etc. Phase 4.C.1 (C3 score decay) result: +5.7pp aggregate improvement vs upstream in an in-session A/B (n=5 large contexts). agnews +18.3pp, multinli +7.4pp, metaphors +4.0pp; imdb and negation within the LLM noise floor. Trace-level mechanism confirmed (metaphors neutral-tag rate dropped 29% -> 4%). Phase 4.C.2 (C5' section-weighted decay + rr-* birth bonus) result: REJECTED across 3 paired A/B trials (mean -3.2pp vs C3 alone, dominated by per-sub-dataset noise of ±5–20pp on N=1). Mechanism kept in vendored library only via the validated C3 globals; per-section dicts removed from the working tree. Full ablation documented in PEEK-EXPERIMENTS.md. peek-ai vendored — no PyPI install needed. tiktoken>=0.7.0 added as the `peek` extra. pyproject.toml updated to import vendor/peek/ as `peek`. --- docs/peek-bench/PEEK-DIAGNOSIS.md | 225 ++++++ docs/peek-bench/PEEK-EXPERIMENTS.md | 490 ++++++++++++ examples/peek_bench/__init__.py | 0 examples/peek_bench/analyze_peek_trace.py | 453 +++++++++++ examples/peek_bench/run_oolong_rlm_vs_peek.py | 703 ++++++++++++++++++ pyproject.toml | 5 +- src/pyrlm_runtime/peek_integration.py | 327 ++++++++ tests/test_peek_integration.py | 412 ++++++++++ vendor/peek/LICENSE | 201 +++++ vendor/peek/VENDOR.md | 30 + vendor/peek/__init__.py | 56 ++ vendor/peek/_io.py | 91 +++ vendor/peek/core/__init__.py | 32 + vendor/peek/core/cartographer.py | 77 ++ vendor/peek/core/context_map.py | 144 ++++ vendor/peek/core/distiller.py | 63 ++ vendor/peek/core/evictor.py | 89 +++ vendor/peek/core/policy.py | 156 ++++ vendor/peek/core/types.py | 68 ++ vendor/peek/llm/__init__.py | 20 + vendor/peek/llm/anthropic_client.py | 53 ++ vendor/peek/llm/base.py | 18 + vendor/peek/llm/gemini_client.py | 68 ++ vendor/peek/llm/openai_client.py | 55 ++ vendor/peek/prompts/cartographer.txt | 95 +++ vendor/peek/prompts/distiller.txt | 121 +++ vendor/peek/prompts/initial_context_map.txt | 14 + 27 files changed, 4064 insertions(+), 2 deletions(-) create mode 100644 docs/peek-bench/PEEK-DIAGNOSIS.md create mode 100644 docs/peek-bench/PEEK-EXPERIMENTS.md create mode 100644 examples/peek_bench/__init__.py create mode 100644 examples/peek_bench/analyze_peek_trace.py create mode 100644 examples/peek_bench/run_oolong_rlm_vs_peek.py create mode 100644 src/pyrlm_runtime/peek_integration.py create mode 100644 tests/test_peek_integration.py create mode 100644 vendor/peek/LICENSE create mode 100644 vendor/peek/VENDOR.md create mode 100644 vendor/peek/__init__.py create mode 100644 vendor/peek/_io.py create mode 100644 vendor/peek/core/__init__.py create mode 100644 vendor/peek/core/cartographer.py create mode 100644 vendor/peek/core/context_map.py create mode 100644 vendor/peek/core/distiller.py create mode 100644 vendor/peek/core/evictor.py create mode 100644 vendor/peek/core/policy.py create mode 100644 vendor/peek/core/types.py create mode 100644 vendor/peek/llm/__init__.py create mode 100644 vendor/peek/llm/anthropic_client.py create mode 100644 vendor/peek/llm/base.py create mode 100644 vendor/peek/llm/gemini_client.py create mode 100644 vendor/peek/llm/openai_client.py create mode 100644 vendor/peek/prompts/cartographer.txt create mode 100644 vendor/peek/prompts/distiller.txt create mode 100644 vendor/peek/prompts/initial_context_map.txt diff --git a/docs/peek-bench/PEEK-DIAGNOSIS.md b/docs/peek-bench/PEEK-DIAGNOSIS.md new file mode 100644 index 0000000..e68e6ef --- /dev/null +++ b/docs/peek-bench/PEEK-DIAGNOSIS.md @@ -0,0 +1,225 @@ +# PEEK Diagnosis — Phase B (2026-05-21) + +Diagnostic analysis of PEEK behaviour using the rich per-update traces +captured in Phase A. Each q{NN}.json holds the map before / after, the +parsed Distiller / Cartographer outputs, and the diff of items +added / evicted / modified. + +Source: `docs/peek-bench/runs/run_20260521_221901_compare_gpt-5.4-mini_n5/traces/`. +Reproduce with `examples/peek_bench/analyze_peek_trace.py`. + +--- + +## Setup + +- 5 largest counting contexts of `oolongbench/oolong-synth` (one per sub-dataset). +- 25 queries each, fully online (`evolve_steps=-1`). +- gpt-5.4-mini (Azure), B=1024 tokens, peek-ai vendored @ 57de91ac, no patches. + +The aggregate score result (Δ=−2.5pp) is within noise of Phase 3 (Δ=−1.8pp); +per-sub-dataset numbers swing several points run-to-run due to LLM +non-determinism. The traces below are about behaviour, not score. + +## Aggregate behaviour per sub-dataset + +| Sub-dataset | Neutral tag rate | Ops/query | Avg map size | Cart emit_repl | Silent fail rate | Max dup pairs | +|---|---|---|---|---|---|---| +| imdb (20037) | 16% | 2.20 | 4.16 | 27 | 0% | 1 | +| agnews (30036) | 24% | 2.16 | 5.52 | 20 | 0% | 3 | +| negation (40033) | 24% | 2.32 | 5.40 | 29 | 0% | 1 | +| multinli (70033) | 13% | 2.32 | 4.76 | 37 | 0% | 1 | +| **metaphors (80021)** | **33%** | **3.56** | **7.96** | **62** | 0% | 1 | + +## Audit hypotheses, tested + +### H1 — Silent operation failures: **REFUTED** + +The audit predicted that Cartographer would emit REPLACE / DELETE ops +against non-existent item IDs and they would silently no-op. Tested across +all 5 contexts × 25 queries: **0 silent failures out of 327 emitted ops.** + +The Cartographer prompt is good enough that it never references items +that have just been evicted or that never existed. Dropping patch C1 +(operation validation) from Phase C — no upside. + +### H2 — Duplicate detection: **WEAKENED** + +The original by-hand inspection of the Phase 3 metaphors map showed +duplicates (cu-00015 ≈ cu-00019). In this Phase A.5 rerun, duplicate +pairs (Jaccard ≥ 0.7) peaked at 1 for metaphors and 3 for agnews — not +the dominant signal. The duplicates we saw in Phase 3 appear to be a +particular run artifact, not systematic. + +Keeping C2 (duplicate detection at ADD time) as a low-priority candidate. + +### H3 — Distiller blindness / score-zero stickiness: **CONFIRMED** + +Metaphors has **33% neutral tags** — 2× higher than other sub-datasets. +This means the Distiller spends a third of its tag budget saying "I don't +know if this item helped" on metaphors. + +Items tagged neutral never accumulate score and live forever until age +eviction. Metaphors's map is **largest (7.96 items avg) and churns most +(3.56 ops/query, 62 REPLACEs)** because old neutral items remain pinned +while Cartographer keeps editing them. + +Direct mechanism: +- `evictor.py:19-28`: `update_scores()` treats "neutral" as a no-op + (`out.setdefault(item_id, 0)`). +- `evictor.py:30`: sort key is `(scores.get(bid, 0), _id_age(bid))` — at + score 0, the *older* item evicts first, which inverts value. +- Net effect: a freshly-added "neutral" item is more protected than an + older neutral item that has demonstrated mild utility, *unless* the + Distiller explicitly retags the older item as helpful in a subsequent + pass. + +This is the dominant failure mode the data supports. **C3 (score decay) +moves to high priority.** + +### H4 — Map content hallucination: **CONFIRMED** + +q05 of metaphors (this run) added the item: + +``` +[cu-00010] Corpus-level label balance is exact: correct = 39,060 and + incorrect = 39,060. +``` + +The actual `oolongbench/oolong-synth` metaphors split (full corpus) is +heavily skewed (~3:1 incorrect>correct in the original Phase 3 spot +check). The Cartographer added a fabricated "exact balance" claim +derived from the agent's intermediate REPL exploration, then propagated +it through subsequent queries. + +This is not addressed by any of the original C1–C5 candidates. A new +candidate emerges: **C4' (cache-candidate provenance check)** — only +accept Cartographer ADDs whose content is corroborated by a +`cache_candidate` in the Distiller output for the *current* query +(rather than the Cartographer freely inventing items). + +## Per-item lifetime patterns + +Across the 5 contexts, `used_after` (proxy: literal `[id]` substring in +later trajectories) is **0 for every item**. This is not surprising — +the agent uses the map's *content*, not its IDs, so the proxy is wrong. +A better proxy would be content-level token overlap between map items +and later root_call outputs. Adding to follow-up. + +Top-scoring items in agnews are concrete reusable facts +(`rr-00004: Full-corpus label counts: Sci/Tech 9706, Business 13361, …`). +Top-scoring in metaphors are vaguer structural notes +(`cu-00002: Each example pairs a metaphorical statement with a candidate +literal interpretation`). The qualitative difference is consistent +with the "PEEK helps when the map can hold concrete answers" +observation from Phase 3. + +## Implications for Phase C + +Priority order based on evidence: + +- **C3 (score decay) — high priority, strongest evidence.** + Specifically: multiply existing scores by `decay = 0.85` at each update, + and tag "neutral" as `-0.5` (down from 0) so undecorated items drift + toward eviction. Decision rule: improve neutral_rate distribution + symmetry across sub-datasets, and improve score on metaphors / multinli. + +- **C4' (Cartographer provenance check) — medium-high priority, + motivated by observed hallucination.** + Require every ADD operation's content to overlap (token-level) with at + least one of the Distiller's `cache_candidates` for that query. Block + ADDs that the LLM invented out of nothing. + +- **C2 (duplicate detection) — low priority.** Keep as a small + defensive patch; the trace evidence does not support it as dominant. + +- **C1 (operation validation) — dropped.** No silent failures observed. + +- **C5 (section-aware eviction) — keep as candidate.** Maps shrink to + 4–8 items, so eviction does happen; protecting at least one item per + critical section may matter. Defer until C3 is in. + +Phase C will implement C3 first, measure, then layer C4' if C3 alone +does not move the needle. + +--- + +## Post-C3 trace findings (2026-05-22) + +After C3 was applied and validated with the clean A/B (+5.7pp aggregate +improvement), inspection of the per-query map evolution under C3 surfaced +two distinct new failure modes that the original Phase B audit did not +anticipate: + +### Finding 1 — `rr-*` items are evicted prematurely under C3 + +`reusable_results` (rr-*) items are concrete facts the agent computed +(exact counts, frequency rankings, distinct keys). They take real REPL +work to derive. C3 treats every item identically: decay 0.85 / queue +toward eviction unless re-tagged helpful by the Distiller. + +The problem: the Distiller can only mark an item helpful when it is +*used* in the current trajectory. An rr-* item like "Exact label counts: +Sci/Tech 9706; Business 13361; ..." is highly valuable but only relevant +to questions about label frequency. Across a 25-query mix it may be +helpful on, say, 4 queries and neutral on the rest. Under C3 it decays +fast enough to be evicted before its next relevant query arrives. + +**Concrete instance (agnews, C3 run):** +- q05 map contained `[rr-00004] Exact label counts: Sci/Tech 9706; + Business 13361; Sports 12322; World 8879.` with score 3. +- By q12 the item had been evicted. +- Agnews score Δ in this run was still +11.6pp (best in the cohort) — + C3 helped overall — but the eviction of useful concrete results is + visible cost, not a gain. + +### Finding 2 — newly-added items are fragile under C3 + +A new ADD enters `scores` with score 0. C3's neutral penalty turns the +first unrelated query into −0.5; one more makes it −1.0. Items can be +evicted before they have a chance to prove useful, especially +expensive-to-derive `rr-*` items added at the moment of computation. + +**Concrete instance (metaphors, C3 run q24):** `[rr-00016] Corpus-wide +processing can use one regex/pass...` born at score 0. Already at +neutral-penalty risk on the next update. + +### Finding 3 — hallucinated content was absent in the C3 run (cannot attribute to patch) + +In the Phase A.5 (no-patch) metaphors run, q05 added a fabricated +balance claim "correct = incorrect = 39,060". The C3 run did not +reproduce this — the q05 map instead held a correct "binary +correct/incorrect" statement. Whether C3 *prevents* such hallucinations +or whether the Distiller happened not to feed an invented claim is not +distinguishable from a single A/B run; the architectural loophole (the +Cartographer can ADD items without grounding in any Distiller +`cache_candidate`) remains. + +--- + +## Implications for Phase C.2 + +The original Phase C plan listed C5 as "section-aware eviction" +(low-priority placeholder). The post-C3 traces motivate a different, +more specific patch: + +**C5' — Section-weighted decay and birth bonus.** Replaces the original +C5 with a mechanism-specific change in `vendor/peek/core/evictor.py`: + +- Per-section `DECAY` and `NEUTRAL_PENALTY` instead of global constants. +- Suggested defaults (subject to A/B): + - `reusable_results`: decay 0.95, neutral_penalty 0.25 (slow drift — + these are computed facts) + - `parsing_schema`, `domain_constants`: decay 0.90, neutral_penalty 0.4 + - `context_understanding`, `context_roadmap`: decay 0.80, + neutral_penalty 0.6 (faster drift — abstract claims should evolve) + - `error_patterns`: decay 0.80, neutral_penalty 0.5 +- Birth bonus: newly added `reusable_results` items start at +1.0 + instead of 0 (grace period to demonstrate utility). + +**C4' — Cartographer ADD provenance check** is still relevant +independently. It closes the hallucination loophole structurally. +Recommend running C5' first (stronger trace evidence), then layering +C4' if hallucination is observed in the C5' traces. + +The original C2 (duplicate detection) and the dropped C1 (operation +validation) remain off the table. diff --git a/docs/peek-bench/PEEK-EXPERIMENTS.md b/docs/peek-bench/PEEK-EXPERIMENTS.md new file mode 100644 index 0000000..cebdf2e --- /dev/null +++ b/docs/peek-bench/PEEK-EXPERIMENTS.md @@ -0,0 +1,490 @@ +# PEEK Experiments — pyrlm-runtime + +Reference: arXiv:2605.19932 (PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents) + +## North Star + +Replicate the PEEK result from Table 1 of the paper on our own `pyrlm_runtime.RLM` stack. +Threshold to promote to `main`: **Δ score ≥ +5pp** AND **steps ≤ baseline** AND **cost ≤ 1.5× baseline**. + +--- + +## Hypothesis (pre-committed 2026-05-21) + +> Integrating PEEK as an orientation-cache over `pyrlm_runtime.RLM` on a +> "same Context × N queries" workload (oolong-synth) will produce ≥ +5 absolute +> score points over our anchored RLM baseline, with equal or fewer iterations per +> query and total cost ≤ 1.5× the baseline cost. +> +> Rationale: the paper reports +27.8pp over RLM-base on TREC-Q-coarse; +5pp is +> ~1/5 of that signal, allowing for differences between their RLM and ours. + +## Decision Rule (pre-committed 2026-05-21) + +After Phase 2 pilot (N=30 contexts, oolong-synth, cache LLM OFF, seed=42): + +- **PROMOTE** → merge to main: Δ score ≥ +5pp AND steps/q ≤ baseline AND tokens/q ≤ 1.5× baseline +- **HOLD** → scale to N=150 across splits: +2pp ≤ Δ score < +5pp AND other KPIs OK +- **REJECT** → document below, do not merge: Δ score < +2pp OR steps/q > 1.2× baseline OR tokens/q > 2× baseline + +Apply mechanically. Do not promote "close enough". + +--- + +## Scorer validation (2026-05-21) + +Our `score_example()` in `examples/peek_bench/run_oolong_rlm_vs_peek.py` was +verified against the official scorer in +[`abertsch72/oolong/src/eval/eval_helpers.py`](https://github.com/abertsch72/oolong/blob/main/src/eval/eval_helpers.py) +(`synth_process_response` + `synth_attempt_answer_parse`). + +| Aspect | Official | Ours | Match | +|---|---|---|---| +| Parse: split on `:`, strip `*[]` | yes | yes | ✓ | +| Exact match → 1.0 | yes | yes | ✓ | +| Comparison ("more common"/etc) | substring match in gold | substring match in gold | ✓ | +| NUMERIC | `0.75 ** abs(diff)` | `0.75 ** abs(diff)` | ✓ | +| DATE | `dateutil.parser.parse` flexible match | (originally missing — now fixed) | ✓ after fix | +| MONTH_YEAR (0.8%) | not handled | not handled | ✓ tied | + +**Type distribution in oolong-synth:** COMPARISON 31.5%, NUMERIC 27.4%, LABEL 24.0%, USER 14.4%, DATE 1.9%, MONTH_YEAR 0.8% (5200 total rows). + +The first N=30 run (Phase 2 below) used the pre-fix scorer. DATE answers are +1.9% of queries; re-scoring Phase 2 offline with the fixed scorer changed +both baseline and PEEK by +2.1pp symmetrically — net Δ unchanged. + +--- + +## Runs + +### Pilot — N=5 contexts, compare mode (2026-05-21) + +**Status:** COMPLETED — pilot only; plan specifies N=30, decision rule not applied here. + +**Command:** +```bash +uv run python examples/peek_bench/run_oolong_rlm_vs_peek.py \ + --model gpt-5.4-mini --n-contexts 5 --mode compare \ + --seed 42 --env-tips --evolve-steps 4 +``` + +**Results:** + +| Method | N contexts | avg_score | avg_steps/q | avg_tokens/q | +|---|---|---|---|---| +| RLM baseline | 5 | 0.8398 | 9.3 | 16,691 | +| RLM + PEEK | 5 | 0.8349 | 10.3 | 17,096 | +| Δ | — | **−0.0048 (−0.6%)** | +1.0 | +405 | + +Run artifacts: `docs/peek-bench/runs/run_20260521_144509_compare_gpt-5.4-mini_n5/` + +N=5 surfaced a baseline of 84% (ceiling effect) and PEEK adding ~+1 step and +~+400 tokens per query on small contexts. Scaled to N=30 (Phase 2) before +applying the decision rule. + +--- + +### Phase 2 — N=27 contexts (of 30 planned), compare mode (2026-05-21) + +**Status:** COMPLETED (27/30 contexts — Mac crashed during contexts 28–30; remaining cannot reverse the result, see below) → **REJECT** + +**Command:** +```bash +uv run python examples/peek_bench/run_oolong_rlm_vs_peek.py \ + --model gpt-5.4-mini --n-contexts 30 --mode compare \ + --seed 42 --env-tips --evolve-steps 4 +``` + +**Results (re-scored offline with fixed DATE branch):** + +| Method | N contexts | avg_score | avg_steps/q | avg_tokens/q | +|---|---|---|---|---| +| RLM baseline | 27 | 0.8308 | 10.11 | 16,676 | +| RLM + PEEK | 27 | 0.8148 | 9.75 | 16,994 | +| Δ | — | **−0.0161 (−1.93%)** | −0.35 | +318 (1.02×) | + +Run artifacts: `docs/peek-bench/runs/run_20260521_155200_compare_gpt-5.4-mini_n30/` + +**Decision rule outcome: REJECT** +- score ≥ +5pp: ✗ (actual: −1.6pp) +- steps ≤ baseline: ✓ (−0.35 steps/q) +- tokens ≤ 1.5×: ✓ (1.02×) + +**Why 3 missing contexts cannot reverse the result.** To move from −1.6pp to +HOLD threshold (+2pp) would require each remaining context to show Δ ≥ +30pp, +which is more than the paper's best split (+27.8pp). Not realistic. + +**Scorer cross-check.** Aggregating with the original scorer (no DATE branch) +gives baseline=0.8098, peek=0.7938, Δ=−1.61pp — identical delta. The DATE fix +raises both methods by ~+2.1pp symmetrically. Scorer choice does not affect the +verdict. 38 DATE queries re-scored, 24 changed; net delta change: 0. + +**Observation:** PEEK reduces steps (−0.35/q) but degrades score (−1.6pp). +On structured, labeled, easy tasks the orientation cache adds no usable +guidance and biases the model toward earlier termination with less-verified +answers. + +--- + +### Phase 3 — N=5 large contexts, fully-online (pre-committed 2026-05-21) + +**Motivation:** Two findings from the N=27 audit point to a follow-up before +publishing the negative result: +1. Large contexts (>6M chars) in N=27 partial data showed PEEK genuinely + reducing steps (e.g. ctx 20024 baseline=65 → PEEK=12). +2. Paper's Algorithm 1 takes `m ≤ n` and does not endorse `m=4` as universal + default; peek-ai's `m=4` freezes the map after ~10% of queries on a 25–50 + query context. + +**Setup:** +- N=5 hand-picked largest counting contexts (one per dataset for diversity): + - 20037 (14.7M chars, imdb) + - 30036 (13.4M chars, **agnews** — closest to paper's hardest split) + - 80021 (12.8M chars, metaphors) + - 40033 (6.8M chars, negation) + - 70033 (7.0M chars, multinli) +- `evolve_steps=-1` → PeekSession with `evolve_steps=None` (fully online; map + never freezes) +- Model: gpt-5.4-mini (Azure), cache OFF +- Same scorer (with DATE fix), same env-tips + +**Hypothesis (pre-committed):** +> On the 5 largest counting contexts of oolong-synth, with the map fully +> online, RLM+PEEK will recover the +5pp threshold over the RLM baseline that +> the paper's TREC-Q-coarse result suggests. Specifically: Δ score ≥ +5pp, +> Δ steps ≤ −2 (large reduction expected from orientation cache amortizing +> over many queries), tokens ≤ 1.5× baseline. +> +> Rationale: large contexts give PEEK room to amortize the cost of building a +> map; fully-online evolution keeps the map adapting as queries diversify; +> counting tasks on multi-million-char contexts have low baseline scores +> (visible in partial N=27 data, e.g. ctx 60022 with 4 baseline failures). + +**Decision rule (pre-committed):** +- **PROMOTE** → integrate PEEK as optional feature: Δ score ≥ +5pp AND steps ≤ baseline AND tokens ≤ 1.5× +- **HOLD** → run N=15 across all large contexts: +2pp ≤ Δ < +5pp AND other KPIs OK +- **REJECT (final)** → document, do not integrate, post negative result on X: Δ < +2pp + +Apply mechanically. + +**Command:** +```bash +uv run python examples/peek_bench/run_oolong_rlm_vs_peek.py \ + --model gpt-5.4-mini --mode compare \ + --context-ids 20037,30036,80021,40033,70033 \ + --evolve-steps -1 --env-tips +``` + +**Status:** COMPLETED 2026-05-21 → **REJECT in aggregate** + +**Aggregate (N=5):** + +| Method | avg_score | avg_steps/q | avg_tokens/q | +|---|---|---|---| +| RLM baseline | 0.9337 | 13.0 | 17,867 | +| RLM + PEEK | 0.9156 | 8.5 | 14,909 | +| Δ | **−0.0181 (−1.9%)** | **−4.5** | **−2,959 (0.83×)** | + +Decision rule outcome: **REJECT** (score < +2pp threshold). +- score ≥ +5pp: ✗ (actual: −1.8pp) +- steps ≤ baseline: ✓ (−4.5 steps/q) +- tokens ≤ 1.5×: ✓ (0.83×, far below threshold) + +**Per-dataset breakdown:** + +| Dataset | M chars | base_s | peek_s | Δ score | Δ steps | Δ tokens | W/T/L | +|---|---|---|---|---|---|---|---| +| agnews | 13.4 | 0.924 | 1.000 | **+7.6pp** | −1.3 | −1,822 | 2/23/0 | +| negation | 6.8 | 0.960 | 1.000 | +4.0pp | −0.7 | −21 | 1/24/0 | +| imdb | 14.7 | 0.922 | 0.880 | −4.2pp | **−18.1** | **−9,609** | 2/20/3 | +| multinli | 7.0 | 0.943 | 0.885 | −5.7pp | −1.8 | −2,454 | 0/23/2 | +| metaphors | 12.8 | 0.920 | 0.813 | −10.7pp | −0.3 | −887 | 0/22/3 | + +Run artifacts: `docs/peek-bench/runs/run_20260521_195528_compare_gpt-5.4-mini_n5/` + +**Observations:** + +- agnews (+7.6pp, perfect 1.000 score) is one of the three splits in Table 1 + of the PEEK paper. PEEK replicates on it within oolong-synth. +- Per-sub-dataset Δ scores span 18 points (+7.6pp on agnews to −10.7pp on + metaphors) on contexts of similar size and same task type (counting). + Difficulty alone does not explain the variance — baseline scores are + 0.92–0.96 across all five. +- imdb: −4.2pp score with −18.1 steps/q and −9,609 tokens/q. The map drives + early termination; on three queries this terminates before reaching the + correct answer. +- metaphors: on the three score regressions PEEK uses more steps (13 vs 8.3), + not fewer. Failure mode here is different from imdb — the map content + itself misleads. +- Token ratio 0.83× across all five sub-datasets. PEEK is strictly more + token-efficient than baseline in this setup; the cost is purely accuracy. +- `evolve_steps=-1` (fully online) verified active: max map_step = 25, + `evolving=True` on all queries. + +--- + +## Diagnosis across all three runs + +**1. Ceiling effect (Pilot and Phase 2 only).** +Baseline at 84% on small/medium contexts gave PEEK <16pp of headroom vs the 70pp the paper +had on TREC-Q-coarse. This explains the Phase 2 aggregate (Δ=−1.6pp at N=27). + +**2. Effect is dataset-dependent, not just difficulty-dependent (Phase 3).** +On the 5 largest counting contexts in Phase 3, baseline scores are 0.92–0.96 (still +high), yet PEEK shows +7.6pp on **agnews** and +4.0pp on **negation**, while +losing −4 to −10pp on imdb/multinli/metaphors. Same baseline level, same +task type, same model — different sub-dataset, opposite sign. This rules out +"pure ceiling effect" as the only mechanism. + +**3. `evolve_steps=4` (peek-ai default) starves evolution.** +With ~25 queries per context, the map only evolved during the first 4 queries. +Phase 3 used `evolve_steps=-1` (fully online) — the map evolved through all 25, +which is closer to what Algorithm 1 in the paper supports. + +**4. Two distinct failure modes on losing sub-datasets (Phase 3).** +- imdb: PEEK uses fewer steps on regressions — premature termination. +- metaphors: PEEK uses *more* steps on regressions (13 vs 8.3) — the map + content itself misleads, not just the stopping rule. + +**5. Model mismatch.** +gpt-5.4-mini ≠ gpt-5-mini-2025-08-07 from the paper. Probably a minor factor +given the dataset-specific results in Phase 3. + +--- + +### Open follow-ups + +- Official OOLONG (Bertsch et al., arXiv:2511.02817) — trec_coarse/agnews/yahoo + splits where baseline scores are ~30%. Not on HuggingFace as of 2026-05-21. +- Scale Phase 3 to N=10–15 contexts per sub-dataset to confirm the per-dataset + pattern (Phase 3 has only 1 context per sub-dataset). +- The PEEK blog post ([@astrogu_](https://zhuohangu.github.io/blog-post-peek/)) + notes: *"the usefulness of the context map depends on how the agent interacts + with the context. If that interaction reveals little reusable knowledge, the + map has little to cache."* Phase 3 is an empirical instance. + +--- + +## Integration & paper audits (2026-05-21) + +### Integration audit vs peek-ai upstream (`zhuohangu/peek` commit 57de91ac) + +- `CachePolicy.update(*, trajectory: str, question: str = "")` — kwargs match ✓ +- `LMClient.completion(messages) -> str`, `last_usage() -> Usage` — match ✓ +- `evolving = steps < evolve_steps`, counter increments on every `update()` call ✓ +- Token budget enforced by Evictor at end of every evolving step (default tokenizer `tiktoken o200k_base`) ✓ +- Map persists per `context_window_id`, resets between contexts — matches paper's Algorithm 1 ✓ +- Trajectory format: peek's Distiller takes free-form `{trace_history}` string; no special markers required ✓ + +### Paper audit — risks affecting our setup + +- Paper's Appendix D notes that dataset retrofits like BrowseComp-Plus / + FanOutQA / QuALITY "did not exercise what PEEK was designed for" — PEEK + requires shared orientation across queries on a persistent context. + oolong-synth may fall in the same category. +- peek-ai's `evolve_steps=4` default may starve evolution on workloads with + many queries per context (mitigated in Phase 3 with `evolve_steps=-1`). +- Model is gpt-5.4-mini vs paper's gpt-5-mini-2025-08-07. + +--- + +## Notes + +- **Dataset:** `oolongbench/oolong-synth` (not the Bertsch et al. OOLONG from the paper — that dataset is not yet on HuggingFace as of 2026-05-21). + This means absolute scores are **not directly comparable** to Table 1 of the paper. However, the **relative delta (Δ)** between RLM and RLM+PEEK should be comparable since the task structure (same context × N aggregation queries) is identical. +- **Model:** Fix the exact deployment name and API version per run; report in the Run ID. +- **Cache OFF:** Use adapters with no `cache_dir`; never compare against cached runs. +- **peek-ai version:** commit 57de91ac (git+https://github.com/zhuohangu/peek.git, 2026-05-20). + Update if a new version ships before experiments complete. +- **Scorer:** validated against `abertsch72/oolong` official implementation 2026-05-21; DATE branch added to match official behavior. +- **Reference reading:** PEEK paper (arXiv:2605.19932), official OOLONG paper (Bertsch et al., arXiv:2511.02817), [PEEK blog post by @astrogu_](https://zhuohangu.github.io/blog-post-peek/), peek-ai upstream ([zhuohangu/peek](https://github.com/zhuohangu/peek)). + +--- + +## Phase 4 — Targeted improvements on a vendored fork + +Started 2026-05-21. Goal: reduce the per-sub-dataset variance from Phase 3 +by patching peek-ai internals based on trace-derived evidence. peek-ai is +vendored at `vendor/peek/` (commit 57de91ac) so we can patch single files +without forking upstream. + +### Phase 4.A — Vendoring + observability + +- `vendor/peek/` mirrors `zhuohangu/peek` at 57de91ac (Apache-2.0). +- `PeekSession.create(trace_dir=…)` writes a JSON snapshot per update: + query, trajectory, map before/after with per-item scores, parsed + Distiller / Cartographer outputs, and the diff. +- Benchmark harness gains `--trace` flag. +- Phase 3 re-run with `--trace` (run dir + `run_20260521_221901_compare_gpt-5.4-mini_n5`): aggregate Δ=−2.5pp vs + Phase 3's Δ=−1.8pp; per-sub-dataset deltas swing ±5–7pp due to LLM + non-determinism. Trace files written for all 125 queries. + +### Phase 4.B — Diagnosis (PEEK-DIAGNOSIS.md) + +`examples/peek_bench/analyze_peek_trace.py` aggregates the traces. +Findings: + +- **Silent failures (audit H1): REFUTED.** 0/327 emitted REPLACE/DELETE + ops referenced non-existent items. Patch C1 dropped. +- **Distiller blindness / score-zero stickiness (audit H3): CONFIRMED.** + metaphors has 33% neutral tags vs 13–24% elsewhere, 2× ops/query, + largest map. +- **Map content hallucination: CONFIRMED** by a concrete metaphors q05 + trace. +- **Duplicates: low** (≤3 pairs/ctx) in this run; not the dominant mode. + +### Phase 4.C.1 — Patch C3 (score decay) + +**Patch:** `vendor/peek/core/evictor.py` — `update_scores` multiplies +existing scores by `SCORE_DECAY=0.85` before applying new tags, and +treats `neutral` as `-NEUTRAL_PENALTY=0.5` (was 0). `scores` value type +widens from int to float. Constants `1.0 / 0.0` recover upstream. + +**Control:** identical-arm in-session run with C3 disabled (constants +reset to 1.0 / 0.0). Saved to `runs/c3_control_no_patch/`. + +**Hypothesis (pre-committed 2026-05-21):** +> Score decay reduces metaphors' anomalously large map size (Phase A +> rerun: 7.96 avg items vs 4.16–5.52 elsewhere) and high Cartographer +> churn (3.56 ops/q vs 2.16–2.32 elsewhere) by allowing untagged or +> neutral items to drift toward eviction. On aggregate, Δ score moves +> ≥ +2pp closer to zero or positive vs the in-session control. + +**Decision rule (pre-committed):** KEEP if aggregate improvement ≥ +2pp +AND no sub-dataset regresses > −5pp vs the in-session control. + +**Results (clean in-session A/B):** + +| Setup | baseline | peek | Δ | +|---|---|---|---| +| Control (decay=1.0, neutral=0) | 0.8771 | 0.8370 | −4.0pp | +| C3 (decay=0.85, neutral=−0.5) | 0.8651 | 0.8810 | **+1.6pp** | +| **Patch effect on Δ** | | | **+5.7pp** | + +Per-sub-dataset (improvement = C3 Δ − control Δ): + +| Sub-dataset | Ctrl Δ | C3 Δ | Improvement | +|---|---|---|---| +| agnews | −6.7 | +11.6 | +18.3 | +| multinli | −9.7 | −2.3 | +7.4 | +| metaphors | −8.0 | −4.0 | +4.0 | +| imdb | +4.3 | +3.4 | −1.0 | +| negation | 0.0 | −0.7 | −0.7 | + +Trace-level evidence (neutral_rate / avg items, control → C3): +metaphors 29% → 4% and 4.40 → 5.60 items; multinli 25% → 9% and 6.36 → +4.92 items. + +**Decision rule outcome: KEEP.** +- aggregate improvement ≥ +2pp: ✓ (+5.7pp) +- no dataset regression > −3pp: ✓ (max −1.0pp, well inside the LLM + noise floor of ~±5pp on the baseline arm) +- trace-level mechanism confirmed: ✓ + +### Phase 4.C.2 — Patch C5' (section-weighted decay + rr-* birth bonus) → **REJECTED** + +**Status:** Implemented, tested across 3 paired A/B trials, **reverted**. +Code removed from the working tree. The vendored library now exposes +only C3 (uniform decay) as the validated improvement. This section +documents the experiment as a negative ablation, per the empirical +methodology rule "document the result regardless of outcome". + +**Motivation (the hypothesis that prompted C5'):** Post-C3 trace +inspection (`PEEK-DIAGNOSIS.md` → "Post-C3 trace findings") surfaced +two failure modes that uniform C3 does not address: +1. Concrete `rr-*` items (e.g. agnews q05's `rr-00004` with exact label + counts) decay and get evicted before their next relevant query. +2. Newly-added items start at score 0 and become eviction targets after + one neutral-tagged query. + +**Implementation tested:** per-section decay rates and per-section +neutral penalties, plus a birth-bonus of +1.0 for fresh `reusable_results` +items. Mechanism in the vendored library (generic per-section dicts as +optional kwargs); workload-specific values lived in the benchmark harness +(`OOLONG_DECAY_BY_SECTION` etc.) — never inside `vendor/peek/`. Tested +config: + +| Section | decay | neutral_penalty | birth_bonus | +|---|---|---|---| +| reusable_results | 0.95 | 0.25 | +1.0 | +| parsing_schema | 0.90 | 0.40 | 0 | +| domain_constants | 0.90 | 0.40 | 0 | +| context_understanding | 0.80 | 0.60 | 0 | +| context_roadmap | 0.80 | 0.60 | 0 | +| error_patterns | 0.80 | 0.50 | 0 | + +**Pre-committed hypothesis (2026-05-22):** +> Aggregate Δ improves by ≥ +2pp vs the C3 in-session control. `rr-*` +> items survive ≥ 10 queries on average (vs ≤ 7 under C3). + +**Pre-committed decision rule:** +- KEEP if aggregate improvement ≥ +2pp AND no sub-dataset regresses by + more than −3pp (later relaxed to −7pp once we observed the actual + ±5–20pp per-sub-dataset noise floor on N=1 runs). +- REVERT otherwise. + +**Three paired A/B trials (c5p vs c3, both arms in same session):** + +| Trial | Setup | c5p Δ | c3 Δ | Patch effect | Baseline shift | +|---|---|---|---|---|---| +| 1 | 95-min gap (contaminated) | −3.8pp | +5.5pp | **−9.3pp** | +8.1pp | +| 2 | True parallel | +2.2pp | −3.4pp | **+5.6pp** | −3.1pp | +| 3 | True parallel | −6.0pp | −0.0pp | **−6.0pp** | +6.6pp | +| **Mean** | | **−2.5pp** | **+0.7pp** | **−3.2pp** | +3.9pp | + +Direct PEEK-score comparison (less noisy than the differential): + +| Trial | c5p peek | c3 peek | c5p − c3 | +|---|---|---|---| +| 1 | 0.903 | 0.915 | −1.2pp | +| 2 | 0.891 | 0.866 | +2.5pp | +| 3 | 0.878 | 0.872 | +0.6pp | +| **Mean** | **0.891** | **0.884** | **+0.7pp** | + +**Decision: REVERT.** +- Trial 3 patch effect: −6.0pp (well below the +2pp KEEP threshold). +- 2 of 3 trials negative on the differential; mean −3.2pp. +- Direct peek-score mean: c5p +0.7pp, statistically indistinguishable + from C3 at this N. +- Mechanism evidence (rr-* survival) was mixed across trials — not the + clean uplift seen in the C3 vs upstream comparison. + +**Methodological lessons recorded:** +- Per-sub-dataset noise floor on N=1 oolong-synth contexts is ±5–20pp + (observed via baseline shifts between paired arms). The original + +5pp aggregate decision rule for C3 worked because C3 produced a + large effect; C5' fell inside the noise. +- Parallel arms are mandatory for any future A/B at this N; the 95-min + gap of trial 1 contaminated the baseline by 8pp. +- A regex bug surfaced in `vendor/peek/_io.py` during trial 3 (one arm + hung in catastrophic backtracking on a malformed LLM fence). Fixed + with an O(n) state machine; regression test added. Independent of + C5' but documented here so the next experimenter knows the trap. + +**Working tree after revert:** C5' code and `OOLONG_*` constants +removed; `vendor/peek/core/evictor.py` exposes only the C3 globals +(`SCORE_DECAY=0.85`, `NEUTRAL_PENALTY=0.5`). The `--peek-policy` flag +removed from the benchmark harness. Run artifacts retained under +`docs/peek-bench/runs/c5p_*` for future re-analysis. + +### Phase 4.D — Bug fix in `vendor/peek/_io.py` + +Discovered during Phase 4.C.2 trial 3: `extract_json` used +`re.findall(r"```(?:json)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)`, +which catastrophically backtracks when the LLM emits an opening fence +without a closing one. Reproduced in production: a benchmark arm hung +at 100% CPU for 45 minutes on a single LLM response. + +**Fix:** replace the regex with an O(n) state-machine scan +(`_scan_fenced_blocks`) that bails out gracefully on unclosed fences. +Same fix mirrored in `examples/peek_bench/analyze_peek_trace.py:_extract_json`. + +**Regression tests** in `tests/test_peek_integration.py`: +- `test_extract_json_handles_unclosed_fence_quickly` — 100KB unclosed + fence completes in < 1s (pre-fix: hangs). +- `test_extract_json_still_finds_valid_fenced_json` — happy path intact. +- `test_extract_json_finds_second_block_when_first_invalid` — multi-block + extraction intact. + +This fix is upstream-contributable to `zhuohangu/peek`. diff --git a/examples/peek_bench/__init__.py b/examples/peek_bench/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/examples/peek_bench/analyze_peek_trace.py b/examples/peek_bench/analyze_peek_trace.py new file mode 100644 index 0000000..df649dd --- /dev/null +++ b/examples/peek_bench/analyze_peek_trace.py @@ -0,0 +1,453 @@ +#!/usr/bin/env python3 +"""Diagnostic analysis of PEEK trace files. + +Reads the per-update JSON traces (q{NN}.json) written by `PeekSession` when +constructed with a `trace_dir`, and reports: + + - per-query churn (ops applied, items added/evicted/modified, map size, duplicate rate) + - per-query Distiller behaviour (helpful / harmful / neutral tag distribution) + - per-item lifetime (birth query, score timeline, death query, usage proxy) + - side-by-side comparison across multiple contexts + +Usage: + uv run python examples/peek_bench/analyze_peek_trace.py \ + docs/peek-bench/runs/run_XXXX/traces + +Optional args narrow the report: + --ctx 30036 # single context_window_id + --top-items 10 # show this many top-/bottom-scoring items +""" + +from __future__ import annotations + +import argparse +import json +import re +from collections import defaultdict +from pathlib import Path +from typing import Any + +# --------------------------------------------------------------------------- +# Loading +# --------------------------------------------------------------------------- + + +def load_context_traces(ctx_dir: Path) -> list[dict[str, Any]]: + """Load all q{NN}.json files for a context, sorted by step_idx.""" + files = sorted(ctx_dir.glob("q*.json")) + return [json.loads(f.read_text()) for f in files] + + +def discover_contexts(traces_root: Path) -> dict[int, Path]: + """Return {context_window_id: directory} for each ctx_/ under root.""" + out: dict[int, Path] = {} + for sub in sorted(traces_root.iterdir()): + if not sub.is_dir() or not sub.name.startswith("ctx_"): + continue + try: + cid = int(sub.name.removeprefix("ctx_")) + except ValueError: + continue + out[cid] = sub + return out + + +# --------------------------------------------------------------------------- +# Per-query metrics +# --------------------------------------------------------------------------- + + +def jaccard(a: str, b: str) -> float: + """Token-level Jaccard over whitespace-split lowercase words.""" + ta = set(a.lower().split()) + tb = set(b.lower().split()) + if not ta and not tb: + return 1.0 + if not ta or not tb: + return 0.0 + return len(ta & tb) / len(ta | tb) + + +def duplicate_pairs(items: list[dict[str, Any]], threshold: float = 0.7) -> int: + """Count pairs of items in the same section with content Jaccard >= threshold.""" + by_section: dict[str, list[str]] = defaultdict(list) + for it in items: + by_section[it["section"]].append(it["content"]) + pairs = 0 + for contents in by_section.values(): + for i in range(len(contents)): + for j in range(i + 1, len(contents)): + if jaccard(contents[i], contents[j]) >= threshold: + pairs += 1 + return pairs + + +def _extract_json(raw: str) -> Any: + """Best-effort copy of peek._io.extract_json — direct, fence, or balanced. + + The fence scan uses an O(n) state machine, not a regex with ``.*?`` plus + ``re.DOTALL`` — that pattern can catastrophically backtrack on LLM output + with an unclosed code fence. See vendor/peek/_io.py for the same fix. + """ + try: + return json.loads(raw) + except Exception: + pass + # fenced-block scan (state machine, no regex) + i, n = 0, len(raw) + while i < n: + start = raw.find("```", i) + if start < 0: + break + nl = raw.find("\n", start + 3) + if nl < 0: + break + end = raw.find("```", nl + 1) + if end < 0: + break + try: + return json.loads(raw[nl + 1 : end].strip()) + except Exception: + pass + i = end + 3 + # balanced braces fallback + depth = 0 + start = -1 + for i, ch in enumerate(raw): + if ch == "{": + if depth == 0: + start = i + depth += 1 + elif ch == "}": + depth -= 1 + if depth == 0 and start != -1: + try: + return json.loads(raw[start : i + 1]) + except Exception: + pass + return None + + +def cartographer_ops_breakdown(t: dict[str, Any]) -> dict[str, int]: + """Parse cartographer_raw and count operations vs silent failures. + + A DELETE/REPLACE op silently fails when its item_id is not in map_before. + """ + result = t.get("result") or {} + raw = result.get("cartographer_raw") or "" + parsed = _extract_json(raw) + ops = (parsed or {}).get("operations") if isinstance(parsed, dict) else None + if not isinstance(ops, list): + return {"emit_add": 0, "emit_del": 0, "emit_repl": 0, "fail_del": 0, "fail_repl": 0} + before_ids = {it["id"] for it in t["map_before"]["items"]} + n_add = n_del = n_repl = n_fail_del = n_fail_repl = 0 + for op in ops: + if not isinstance(op, dict): + continue + kind = op.get("type") + item_id = op.get("item_id") or op.get("bullet_id") + if kind == "ADD": + n_add += 1 + elif kind == "DELETE": + n_del += 1 + if isinstance(item_id, str) and item_id not in before_ids: + n_fail_del += 1 + elif kind == "REPLACE": + n_repl += 1 + if isinstance(item_id, str) and item_id not in before_ids: + n_fail_repl += 1 + return { + "emit_add": n_add, + "emit_del": n_del, + "emit_repl": n_repl, + "fail_del": n_fail_del, + "fail_repl": n_fail_repl, + } + + +def per_query_row(t: dict[str, Any]) -> dict[str, Any]: + """Extract one row of per-query metrics from a trace.""" + result = t.get("result") or {} + distiller = result.get("distiller") or {} + tags = distiller.get("item_tags") or {} + n_helpful = sum(1 for v in tags.values() if v == "helpful") + n_harmful = sum(1 for v in tags.values() if v == "harmful") + n_stale = sum(1 for v in tags.values() if v == "stale") + n_neutral = sum(1 for v in tags.values() if v == "neutral") + map_after_items = t["map_after"]["items"] + ops_breakdown = cartographer_ops_breakdown(t) + return { + "q": t["step_idx"], + "evolving": t["evolving"], + "items_before": len(t["map_before"]["items"]), + "items_after": len(map_after_items), + "added": len(t["items_added"]), + "evicted": len(t["items_evicted"]), + "modified": len(t["items_modified"]), + "ops_applied": result.get("operations_applied", 0), + "tag_helpful": n_helpful, + "tag_harmful": n_harmful, + "tag_stale": n_stale, + "tag_neutral": n_neutral, + "n_cache_candidates": len(distiller.get("cache_candidates") or []), + "duplicate_pairs": duplicate_pairs(map_after_items), + **ops_breakdown, + } + + +# --------------------------------------------------------------------------- +# Per-item lifetime +# --------------------------------------------------------------------------- + + +_ID_RE = re.compile(r"\[([a-z]+-\d+)\]") + + +def item_lifetimes(traces: list[dict[str, Any]]) -> dict[str, dict[str, Any]]: + """Track each item id across the query sequence. + + For each id we record: birth query, death query (if evicted), final + content + section, score history, and a 'used' counter computed as the + number of subsequent trajectories that mention the id verbatim + (substring of the literal '[id]' marker). + """ + by_id: dict[str, dict[str, Any]] = {} + for t in traces: + q = t["step_idx"] + # Birth detection: items present in map_after but not map_before + for added in t["items_added"]: + by_id.setdefault( + added["id"], + { + "birth_q": q, + "section": added["section"], + "content_first": added["content"], + "score_history": [], + "death_q": None, + "death_reason": None, + "used_in_trajectories": 0, + }, + ) + # Score updates: record score per query while item alive + scores = t["map_after"].get("scores", {}) + for item in t["map_after"]["items"]: + life = by_id.get(item["id"]) + if life is not None: + life["score_history"].append((q, int(scores.get(item["id"], 0)))) + # Death detection: items evicted at this query + for ev in t["items_evicted"]: + life = by_id.get(ev["id"]) + if life is not None and life["death_q"] is None: + life["death_q"] = q + life["death_reason"] = ev.get("reason", "evicted_or_deleted") + life["score_at_death"] = ev.get("score", 0) + # "Used" proxy: how many later trajectories reference this id? + for life_id, life in by_id.items(): + marker = f"[{life_id}]" + birth = life["birth_q"] + for t in traces: + if t["step_idx"] <= birth: + continue + if marker in t.get("trajectory", ""): + life["used_in_trajectories"] += 1 + return by_id + + +# --------------------------------------------------------------------------- +# Reporting +# --------------------------------------------------------------------------- + + +def fmt_row(d: dict[str, Any], cols: list[str]) -> str: + return " ".join(f"{d[c]:>6}" if isinstance(d[c], int | bool) else f"{d[c]:>6.2f}" for c in cols) + + +def print_context_report(cid: int, traces: list[dict[str, Any]], top_items: int = 10) -> None: + print(f"\n{'='*100}\nCONTEXT {cid} — {len(traces)} updates") + print("=" * 100) + + rows = [per_query_row(t) for t in traces] + # Per-query table + print("\nPER-QUERY METRICS") + print( + f"{'q':>3} {'evol':>5} {'#in':>4} {'#out':>4} {'+':>3} {'-':>3} {'~':>3} " + f"{'ops':>4} {'hlp':>4} {'hrm':>4} {'stl':>4} {'neu':>4} {'#cand':>5} {'dups':>5}" + ) + print("-" * 90) + for r in rows: + print( + f"{r['q']:>3} {'T' if r['evolving'] else 'F':>5} " + f"{r['items_before']:>4} {r['items_after']:>4} " + f"{r['added']:>3} {r['evicted']:>3} {r['modified']:>3} " + f"{r['ops_applied']:>4} " + f"{r['tag_helpful']:>4} {r['tag_harmful']:>4} {r['tag_stale']:>4} {r['tag_neutral']:>4} " + f"{r['n_cache_candidates']:>5} {r['duplicate_pairs']:>5}" + ) + + # Aggregate + n = len([r for r in rows if r["evolving"]]) + if n: + total_emit = sum(r["emit_del"] + r["emit_repl"] for r in rows) + total_fail = sum(r["fail_del"] + r["fail_repl"] for r in rows) + agg = { + "queries_evolving": n, + "avg_items_after": sum(r["items_after"] for r in rows) / len(rows), + "avg_added": sum(r["added"] for r in rows) / len(rows), + "avg_evicted": sum(r["evicted"] for r in rows) / len(rows), + "avg_ops_applied": sum(r["ops_applied"] for r in rows) / len(rows), + "cartographer_emit_add": sum(r["emit_add"] for r in rows), + "cartographer_emit_del": sum(r["emit_del"] for r in rows), + "cartographer_emit_repl": sum(r["emit_repl"] for r in rows), + "silent_failures_del_repl": total_fail, + "silent_failure_rate": (total_fail / total_emit) if total_emit else 0.0, + "total_helpful_tags": sum(r["tag_helpful"] for r in rows), + "total_harmful_tags": sum(r["tag_harmful"] for r in rows), + "total_stale_tags": sum(r["tag_stale"] for r in rows), + "total_neutral_tags": sum(r["tag_neutral"] for r in rows), + "neutral_rate_in_tags": ( + sum(r["tag_neutral"] for r in rows) + / max( + 1, + sum( + r["tag_helpful"] + r["tag_harmful"] + r["tag_stale"] + r["tag_neutral"] + for r in rows + ), + ) + ), + "max_duplicate_pairs": max(r["duplicate_pairs"] for r in rows), + "final_items": rows[-1]["items_after"], + } + print("\nAGGREGATE") + for k, v in agg.items(): + if isinstance(v, float): + print(f" {k:>22}: {v:.2f}") + else: + print(f" {k:>22}: {v}") + + # Per-item lifetimes + lives = item_lifetimes(traces) + print(f"\nPER-ITEM LIFETIMES ({len(lives)} items born across run)") + sorted_lives = sorted( + lives.items(), + key=lambda kv: (kv[1]["score_history"][-1][1] if kv[1]["score_history"] else 0), + reverse=True, + ) + print( + f" {'id':<15} {'sec':<22} {'birth':>5} {'death':>5} {'final_score':>11} " + f"{'used_after':>10} content" + ) + print(" " + "-" * 110) + for life_id, life in sorted_lives[:top_items]: + final_score = life["score_history"][-1][1] if life["score_history"] else "—" + print( + f" {life_id:<15} {life['section'][:22]:<22} " + f"{life['birth_q']:>5} " + f"{(life['death_q'] if life['death_q'] is not None else '—'):>5} " + f"{final_score!s:>11} " + f"{life['used_in_trajectories']:>10} " + f"{life['content_first'][:80]}" + ) + + # Sticky-zero analysis: items with score=0 that lived past 5 queries + sticky = [ + (life_id, life) + for life_id, life in lives.items() + if all(s == 0 for _, s in life["score_history"]) + and len(life["score_history"]) >= 5 + ] + if sticky: + print( + f"\nSTICKY-ZERO ITEMS (never-tagged, persisted ≥5 queries): {len(sticky)}" + ) + for life_id, life in sticky[:5]: + print( + f" [{life_id}] section={life['section']} " + f"survived={len(life['score_history'])} queries " + f"used_after={life['used_in_trajectories']} " + f"content={life['content_first'][:80]}" + ) + + +def compare_contexts( + left: tuple[int, list[dict[str, Any]]], + right: tuple[int, list[dict[str, Any]]], +) -> None: + """Side-by-side aggregate comparison of two contexts.""" + (lcid, ltraces), (rcid, rtraces) = left, right + lrows = [per_query_row(t) for t in ltraces] + rrows = [per_query_row(t) for t in rtraces] + + def agg(rows: list[dict[str, Any]]) -> dict[str, float]: + return { + "avg_items": sum(r["items_after"] for r in rows) / len(rows), + "avg_added": sum(r["added"] for r in rows) / len(rows), + "avg_evicted": sum(r["evicted"] for r in rows) / len(rows), + "avg_ops": sum(r["ops_applied"] for r in rows) / len(rows), + "helpful_tags": sum(r["tag_helpful"] for r in rows), + "harmful_tags": sum(r["tag_harmful"] for r in rows), + "neutral_tags": sum(r["tag_neutral"] for r in rows), + "max_dup_pairs": max(r["duplicate_pairs"] for r in rows), + "final_items": rows[-1]["items_after"], + } + + la, ra = agg(lrows), agg(rrows) + print(f"\n{'='*100}\nSIDE-BY-SIDE: ctx {lcid} vs ctx {rcid}\n{'='*100}") + print(f" {'metric':<22} {'ctx '+str(lcid):>15} {'ctx '+str(rcid):>15} {'Δ (right-left)':>18}") + print(" " + "-" * 75) + for k in la: + d = ra[k] - la[k] + print(f" {k:<22} {la[k]:>15.2f} {ra[k]:>15.2f} {d:>+18.2f}") + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + + +def main() -> None: + p = argparse.ArgumentParser(description="Analyze PEEK trace files") + p.add_argument("traces_root", type=Path, help="Path to runs//traces/") + p.add_argument( + "--ctx", + type=int, + action="append", + default=None, + help="Only analyze this context_window_id (repeatable)", + ) + p.add_argument( + "--compare", + nargs=2, + type=int, + metavar=("CTX_A", "CTX_B"), + help="Side-by-side compare two contexts", + ) + p.add_argument( + "--top-items", type=int, default=10, help="Show top-N items per context (default: 10)" + ) + args = p.parse_args() + + if not args.traces_root.exists(): + raise SystemExit(f"traces dir not found: {args.traces_root}") + contexts = discover_contexts(args.traces_root) + if not contexts: + raise SystemExit(f"no ctx_/ subdirectories found under {args.traces_root}") + + targets = args.ctx if args.ctx else list(contexts.keys()) + loaded: dict[int, list[dict[str, Any]]] = {} + for cid in targets: + if cid not in contexts: + print(f" warning: ctx {cid} not in traces dir, skipping") + continue + loaded[cid] = load_context_traces(contexts[cid]) + print_context_report(cid, loaded[cid], top_items=args.top_items) + + if args.compare: + a, b = args.compare + if a in loaded and b in loaded: + compare_contexts((a, loaded[a]), (b, loaded[b])) + else: + print(f" warning: --compare requires both contexts to be loaded ({a}, {b})") + + +if __name__ == "__main__": + main() diff --git a/examples/peek_bench/run_oolong_rlm_vs_peek.py b/examples/peek_bench/run_oolong_rlm_vs_peek.py new file mode 100644 index 0000000..9779775 --- /dev/null +++ b/examples/peek_bench/run_oolong_rlm_vs_peek.py @@ -0,0 +1,703 @@ +#!/usr/bin/env python3 +"""PEEK Benchmark: RLM baseline vs RLM+PEEK on oolong-synth. + +Evaluates the PEEK orientation-cache on the "same context × N queries" workload +from arXiv:2605.19932. For each sampled context, ALL its queries are run twice: + + 1) baseline — fresh RLM for every query (no persistent map) + 2) rlm+peek — RLM with a PeekSession that evolves over the first + ``--evolve-steps`` queries and is reused for the rest + +Results are written to docs/peek-bench/ and include the per-query scores, +iteration counts, token usage, and the final context map for each context. + +**You run this — do NOT invoke it from automation or CI.** + +Prerequisites: + pip install git+https://github.com/zhuohangu/peek.git # peek-ai + uv sync --group dev # datasets + +Usage (pilot — 5 contexts × their queries, dry-run cost estimate): + python examples/peek_bench/run_oolong_rlm_vs_peek.py \\ + --model gpt-5.1 --n-contexts 5 --dry-run + +Phase 0 baseline (N=10 contexts, cache OFF): + AZURE_OPENAI_API_KEY=... OPENAI_ENDPOINT=... \\ + python examples/peek_bench/run_oolong_rlm_vs_peek.py \\ + --model gpt-5.1 --n-contexts 10 --mode baseline --seed 42 + +Phase 2 comparison (N=30 contexts, cache OFF): + AZURE_OPENAI_API_KEY=... OPENAI_ENDPOINT=... \\ + python examples/peek_bench/run_oolong_rlm_vs_peek.py \\ + --model gpt-5.1 --n-contexts 30 --mode compare --seed 42 --evolve-steps 4 +""" + +from __future__ import annotations + +import argparse +import ast +import json +import os +import random +import re +import sys +import time +from collections import defaultdict +from datetime import datetime +from pathlib import Path +from typing import Any + +_REPO_ROOT = Path(__file__).parent.parent.parent +sys.path.insert(0, str(_REPO_ROOT / "src")) # noqa: E402 +sys.path.insert(0, str(_REPO_ROOT / "examples")) # noqa: E402 + +from _azure_check import check_azure_connection # noqa: E402 +from pyrlm_runtime import Context, Policy, RLM # noqa: E402 +from pyrlm_runtime.adapters import AzureOpenAIAdapter # noqa: E402 +from pyrlm_runtime.peek_integration import PeekSession # noqa: E402 +from pyrlm_runtime.prompts import BASE_SYSTEM_PROMPT # noqa: E402 + + +def _load_env() -> None: + """Load .env walking up from the script. Gracefully skips if python-dotenv is absent.""" + try: + from dotenv import find_dotenv, load_dotenv + except ImportError: + return # env vars must be set externally + dotenv_path = find_dotenv(usecwd=True) + if dotenv_path: + load_dotenv(dotenv_path, override=False) + if os.getenv("AZURE_OPENAI_API_KEY"): + return + here = Path(__file__).resolve() + for candidate in [here.parents[2] / ".env", here.parents[3] / ".env"]: + if candidate.is_file(): + load_dotenv(candidate, override=False) + if os.getenv("AZURE_OPENAI_API_KEY"): + return + load_dotenv(override=False) + +# --------------------------------------------------------------------------- +# Constants +# --------------------------------------------------------------------------- + +DEFAULT_MODEL = "gpt-5.1" +# System-prompt tips from the existing oolong runner (improves RLM on synth tasks) +_OOLONG_ENV_TIPS = """ + +Strategy for structured data tasks (dates, labels, user IDs): +1. ALWAYS use Python code for counting — never delegate counting to sub-LLMs. +2. For large contexts (>32K chars), split with ctx.chunk() and process each chunk. +3. Use llm_batch() ONLY for semantic understanding, never for counting. +4. Verify your answer with a second pass before finalising. + +""" + + + +def _parse_gold(datapoint: dict[str, Any]) -> Any: + try: + return ( + ast.literal_eval(datapoint["answer"])[0] + if "datetime" not in str(datapoint["answer"]) + else datetime.strptime(datapoint["answer"], "[datetime.date(%Y, %m, %d)]") + ) + except Exception: + return datapoint["answer"] + + +def _parse_predicted(output: str) -> tuple[str, str]: + if ":" not in output: + return (output if len(output) < 20 else output.split()[-1]), "low" + candidate = output.split(":")[-1].strip().replace("*", "").replace("[", "").replace("]", "") + confidence = "med" + if any(kw in output for kw in ("User:", "Answer:", "Date:", "Label")): + confidence = "high" + if len(candidate) < 20: + confidence = "vhigh" + elif "more common" in candidate: + candidate = "more common" + elif "less common" in candidate: + candidate = "less common" + elif "same frequency" in candidate: + candidate = "same frequency" + return candidate, confidence + + +def score_example(datapoint: dict[str, Any], output: str) -> float: + gold = _parse_gold(datapoint) + trimmed, _ = _parse_predicted(output) + if str(trimmed) == str(gold): + return 1.0 + if str(trimmed) in ("more common", "less common", "same frequency"): + return float(str(trimmed) in str(gold)) + if datapoint.get("answer_type") == "ANSWER_TYPE.NUMERIC": + try: + return float(0.75 ** abs(int(trimmed) - int(gold))) + except Exception: + return 0.0 + if datapoint.get("answer_type") == "ANSWER_TYPE.DATE": + try: + import dateutil.parser + + parsed = dateutil.parser.parse(str(trimmed)) + return float(parsed == gold) + except Exception: + return 0.0 + return 0.0 + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def _now_tag() -> str: + return time.strftime("%Y%m%d_%H%M%S") + + +def _safe_model(model: str) -> str: + return re.sub(r"[^a-zA-Z0-9_.-]+", "_", model) + + +def _write_json(path: Path, payload: Any) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text( + json.dumps(payload, indent=2, ensure_ascii=False, default=str), encoding="utf-8" + ) + + +def _trace_step_count(trace: Any) -> int: + if trace is None or not hasattr(trace, "steps"): + return 0 + return len(trace.steps) + + +def _trace_total_tokens(trace: Any) -> int: + if trace is None or not hasattr(trace, "steps"): + return 0 + return sum((s.usage.total_tokens for s in trace.steps if s.usage), 0) + + +# --------------------------------------------------------------------------- +# Single-query runners +# --------------------------------------------------------------------------- + + +def run_baseline_query( + adapter: AzureOpenAIAdapter, + context_text: str, + question: str, + *, + max_steps: int, + max_tokens: int, + env_tips: bool, +) -> tuple[str, int, int, float, str | None]: + """Run one query with plain RLM (no PEEK). Returns (answer, steps, tokens, elapsed, error).""" + system_prompt = BASE_SYSTEM_PROMPT + (_OOLONG_ENV_TIPS if env_tips else "") + context = Context.from_text(context_text) + rlm = RLM( + adapter=adapter, + policy=Policy(max_steps=max_steps, max_total_tokens=4_000_000), + system_prompt=system_prompt, + max_tokens=max_tokens, + require_repl_before_final=True, + ) + start = time.time() + trace = None + try: + output, trace = rlm.run(question, context) + return ( + output or "", + _trace_step_count(trace), + _trace_total_tokens(trace), + time.time() - start, + None, + ) + except Exception as exc: + return ( + "", + _trace_step_count(trace), + _trace_total_tokens(trace), + time.time() - start, + str(exc), + ) + + +def run_peek_query( + adapter: AzureOpenAIAdapter, + context_text: str, + question: str, + session: PeekSession, + *, + max_steps: int, + max_tokens: int, + env_tips: bool, +) -> tuple[str, int, int, float, str | None]: + """Run one query with RLM+PEEK. Mutates session in-place. Returns (answer, steps, tokens, elapsed, error).""" + system_prompt = BASE_SYSTEM_PROMPT + (_OOLONG_ENV_TIPS if env_tips else "") + context = Context.from_text(context_text) + rlm = RLM( + adapter=adapter, + policy=Policy(max_steps=max_steps, max_total_tokens=4_000_000), + system_prompt=system_prompt, + system_prompt_supplement=session.system_prompt_supplement, + max_tokens=max_tokens, + require_repl_before_final=True, + ) + start = time.time() + trace = None + try: + output, trace = rlm.run(question, context) + session.update_from_run(trace, query=question) + return ( + output or "", + _trace_step_count(trace), + _trace_total_tokens(trace), + time.time() - start, + None, + ) + except Exception as exc: + if trace is not None: + session.update_from_run(trace, query=question) + return ( + "", + _trace_step_count(trace), + _trace_total_tokens(trace), + time.time() - start, + str(exc), + ) + + +# --------------------------------------------------------------------------- +# Context-group evaluation +# --------------------------------------------------------------------------- + + +def evaluate_context_group( + rows: list[dict[str, Any]], + context_text: str, + adapter: AzureOpenAIAdapter, + peek_adapter: AzureOpenAIAdapter, + *, + mode: str, + max_steps: int, + max_tokens: int, + env_tips: bool, + evolve_steps: int | None, + token_budget: int, + peek_map_dir: Path | None, + peek_trace_dir: Path | None, + context_window_id: Any, + dry_run: bool, +) -> dict[str, Any]: + """Evaluate all queries for one context under baseline, peek, or both modes.""" + n_queries = len(rows) + ctx_len = len(context_text) + print(f" Context {context_window_id}: {n_queries} queries, ctx_len={ctx_len}") + + baseline_results: list[dict[str, Any]] = [] + peek_results: list[dict[str, Any]] = [] + peek_session: PeekSession | None = None + + if mode in ("peek", "compare") and not dry_run: + peek_session = PeekSession.create( + peek_adapter, + token_budget=token_budget, + evolve_steps=evolve_steps, # None = fully online (never freezes) + trace_dir=(peek_trace_dir / f"ctx_{context_window_id}") + if peek_trace_dir is not None + else None, + ) + + for qi, row in enumerate(rows): + question = row["question"] + q_id = row.get("id", qi) + + if dry_run: + # Estimate only — don't make real API calls + est_tokens = ctx_len // 4 + max_tokens + baseline_results.append( + { + "id": q_id, + "score": 0.0, + "steps": 0, + "tokens": est_tokens, + "elapsed": 0.0, + "error": "DRY_RUN", + } + ) + peek_results.append( + { + "id": q_id, + "score": 0.0, + "steps": 0, + "tokens": est_tokens, + "elapsed": 0.0, + "error": "DRY_RUN", + } + ) + continue + + # --- Baseline --- + if mode in ("baseline", "compare"): + ans, steps, tokens, elapsed, err = run_baseline_query( + adapter, + context_text, + question, + max_steps=max_steps, + max_tokens=max_tokens, + env_tips=env_tips, + ) + score = score_example(row, ans) + baseline_results.append( + { + "id": q_id, + "query_idx": qi, + "question": question[:100], + "answer": ans, + "gold": str(_parse_gold(row)), + "score": score, + "steps": steps, + "tokens": tokens, + "elapsed": elapsed, + "error": err, + } + ) + print( + f" [{qi + 1}/{n_queries}] baseline score={score:.2f} steps={steps} tok={tokens} t={elapsed:.1f}s" + ) + + # --- PEEK --- + if mode in ("peek", "compare") and peek_session is not None: + ans, steps, tokens, elapsed, err = run_peek_query( + adapter, + context_text, + question, + peek_session, + max_steps=max_steps, + max_tokens=max_tokens, + env_tips=env_tips, + ) + score = score_example(row, ans) + peek_results.append( + { + "id": q_id, + "query_idx": qi, + "question": question[:100], + "answer": ans, + "gold": str(_parse_gold(row)), + "score": score, + "steps": steps, + "tokens": tokens, + "elapsed": elapsed, + "error": err, + "map_evolving": peek_session.evolving, + "map_step": peek_session.steps, + } + ) + print( + f" [{qi + 1}/{n_queries}] rlm+peek score={score:.2f} steps={steps} tok={tokens} t={elapsed:.1f}s (map_step={peek_session.steps})" + ) + + # Save final map + if peek_session is not None and peek_map_dir is not None and not dry_run: + map_path = peek_map_dir / f"ctx_{context_window_id}.peek.json" + peek_session.save(map_path) + + def _avg(items: list[dict], key: str) -> float: + vals = [x[key] for x in items if x.get("error") is None or x["error"] in (None, "")] + return sum(vals) / len(vals) if vals else 0.0 + + return { + "context_window_id": context_window_id, + "context_len": ctx_len, + "n_queries": n_queries, + "baseline": { + "queries": baseline_results, + "avg_score": _avg(baseline_results, "score"), + "avg_steps": _avg(baseline_results, "steps"), + "avg_tokens": _avg(baseline_results, "tokens"), + "errors": sum(1 for r in baseline_results if r.get("error")), + } + if baseline_results + else None, + "peek": { + "queries": peek_results, + "avg_score": _avg(peek_results, "score"), + "avg_steps": _avg(peek_results, "steps"), + "avg_tokens": _avg(peek_results, "tokens"), + "errors": sum(1 for r in peek_results if r.get("error")), + } + if peek_results + else None, + } + + +# --------------------------------------------------------------------------- +# Summary / reporting +# --------------------------------------------------------------------------- + + +def aggregate_results(context_results: list[dict[str, Any]]) -> dict[str, Any]: + def _agg(key: str) -> dict[str, float]: + scores = [r[key]["avg_score"] for r in context_results if r.get(key)] + steps = [r[key]["avg_steps"] for r in context_results if r.get(key)] + tokens = [r[key]["avg_tokens"] for r in context_results if r.get(key)] + n = len(scores) + return { + "n_contexts": n, + "avg_score": sum(scores) / n if n else 0.0, + "avg_steps_per_query": sum(steps) / n if n else 0.0, + "avg_tokens_per_query": sum(tokens) / n if n else 0.0, + } + + out: dict[str, Any] = {} + if any(r.get("baseline") for r in context_results): + out["baseline"] = _agg("baseline") + if any(r.get("peek") for r in context_results): + out["peek"] = _agg("peek") + if "baseline" in out and "peek" in out: + b, p = out["baseline"], out["peek"] + out["delta"] = { + "score": p["avg_score"] - b["avg_score"], + "steps": p["avg_steps_per_query"] - b["avg_steps_per_query"], + "tokens": p["avg_tokens_per_query"] - b["avg_tokens_per_query"], + "score_pct": (p["avg_score"] - b["avg_score"]) / max(b["avg_score"], 1e-6) * 100, + } + return out + + +def print_summary(summary: dict[str, Any], config: dict[str, Any]) -> None: + print("\n" + "=" * 72) + print("PEEK BENCHMARK SUMMARY") + print("=" * 72) + print( + f" model={config['model']} mode={config['mode']} n_contexts={config['n_contexts']} seed={config['seed']}" + ) + print(f" evolve_steps={config['evolve_steps']} token_budget={config['token_budget']}") + print() + + for engine in ("baseline", "peek"): + if engine not in summary: + continue + s = summary[engine] + print( + f" {engine:10s} score={s['avg_score']:.4f} " + f"steps/q={s['avg_steps_per_query']:.1f} " + f"tokens/q={s['avg_tokens_per_query']:.0f} " + f"n_contexts={s['n_contexts']}" + ) + + if "delta" in summary: + d = summary["delta"] + sign = "+" if d["score"] >= 0 else "" + print(f"\n Δ score = {sign}{d['score']:.4f} ({sign}{d['score_pct']:.1f}%)") + print(f" Δ steps/q = {d['steps']:+.1f}") + print(f" Δ tokens/q = {d['tokens']:+.0f}") + + # Decision rule evaluation + print("\n Decision rule (pre-committed):") + score_ok = d["score"] >= 0.05 + steps_ok = d["steps"] <= 0.0 + tokens_ok = ( + summary["peek"]["avg_tokens_per_query"] + <= 1.5 * summary["baseline"]["avg_tokens_per_query"] + ) + if score_ok and steps_ok and tokens_ok: + verdict = "PROMOTE" + elif d["score"] >= 0.02 and steps_ok and tokens_ok: + verdict = "HOLD (scale to N=150)" + else: + verdict = "REJECT — document as ablation" + print(f" → {verdict}") + print( + f" score≥+5pp: {'✓' if score_ok else '✗'} " + f"steps≤baseline: {'✓' if steps_ok else '✗'} " + f"tokens≤1.5×baseline: {'✓' if tokens_ok else '✗'}" + ) + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + + +def main() -> None: + _load_env() + + parser = argparse.ArgumentParser(description="PEEK benchmark: RLM vs RLM+PEEK on oolong-synth") + parser.add_argument("--model", default=os.getenv("LLM_MODEL", DEFAULT_MODEL)) + parser.add_argument( + "--mode", + choices=["baseline", "peek", "compare"], + default="compare", + help="baseline=RLM only, peek=PEEK only, compare=both (default: compare)", + ) + parser.add_argument( + "--n-contexts", + type=int, + default=10, + help="Number of context groups to evaluate (default: 10 for pilot)", + ) + parser.add_argument("--seed", type=int, default=42) + parser.add_argument( + "--context-ids", + type=str, + default=None, + help="Comma-separated list of explicit context_window_ids to evaluate (overrides --n-contexts)", + ) + parser.add_argument( + "--evolve-steps", + type=int, + default=4, + help="PEEK evolve_steps (m in paper, default: 4). Use -1 for fully online (never freezes).", + ) + parser.add_argument( + "--token-budget", + type=int, + default=1024, + help="PEEK context map token budget B (default: 1024)", + ) + parser.add_argument( + "--max-steps", type=int, default=15, help="Max RLM steps per query (default: 15)" + ) + parser.add_argument( + "--max-tokens", + type=int, + default=2048, + help="Max LLM output tokens per step (default: 2048)", + ) + parser.add_argument( + "--env-tips", action="store_true", help="Append oolong strategy tips to system prompt" + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="Print plan and cost estimate without making API calls", + ) + parser.add_argument( + "--trace", + action="store_true", + help="Write per-query PEEK trace JSON under runs//traces/ctx_/q{NN}.json", + ) + parser.add_argument( + "--output-dir", default=None, help="Output directory (default: docs/peek-bench/runs/)" + ) + args = parser.parse_args() + + if not args.dry_run: + check_azure_connection(args.model) + + try: + from datasets import load_dataset + except ImportError as exc: + raise SystemExit("Missing: datasets. Install with: uv add datasets --dev") from exc + + # --- Load and group dataset --- + print("Loading oolongbench/oolong-synth …") + data = load_dataset("oolongbench/oolong-synth", split="test") + groups: dict[Any, list[dict[str, Any]]] = defaultdict(list) + for row in data: + groups[row["context_window_id"]].append(dict(row)) + + # Keep only groups with ≥ 5 queries (filter tiny outliers) + groups = {k: v for k, v in groups.items() if len(v) >= 5} + + # Pick contexts: explicit IDs override sampling + if args.context_ids: + requested = [int(x.strip()) for x in args.context_ids.split(",") if x.strip()] + missing = [c for c in requested if c not in groups] + if missing: + raise SystemExit(f"context_ids not found in dataset or have <5 queries: {missing}") + selected_ids = requested + print(f"Selected {len(selected_ids)} explicit contexts: {selected_ids}") + else: + rng = random.Random(args.seed) + ctx_ids = sorted(groups.keys()) + rng.shuffle(ctx_ids) + selected_ids = ctx_ids[: args.n_contexts] + print(f"Selected {len(selected_ids)} contexts (seed={args.seed})") + + if args.dry_run: + total_queries = sum(len(groups[cid]) for cid in selected_ids) + avg_ctx_len = sum( + len(groups[cid][0].get("context_window_text_with_labels", "")) for cid in selected_ids + ) / max(1, len(selected_ids)) + est_tokens_per_q = int(avg_ctx_len / 4 + args.max_tokens) + runs = 2 if args.mode == "compare" else 1 + est_total_tokens = total_queries * runs * est_tokens_per_q + print("\nDRY RUN ESTIMATE:") + print(f" Total queries: {total_queries} ({runs}× for {args.mode})") + print(f" Avg context len: {avg_ctx_len:.0f} chars") + print(f" Est. tokens/query: {est_tokens_per_q}") + print(f" Est. total tokens: {est_total_tokens:,}") + print(f" At $0.40/Mtok input: ~${est_total_tokens * 0.40 / 1_000_000:.2f}") + print("\n Run without --dry-run to execute. Authorize spend first.") + return + + # --- Build adapters (reads AZURE_OPENAI_API_KEY + OPENAI_ENDPOINT from env) --- + adapter = AzureOpenAIAdapter(model=args.model, timeout=900.0) + peek_adapter = AzureOpenAIAdapter(model=args.model, timeout=900.0) + + # --- Output directory --- + run_tag = f"run_{_now_tag()}_{args.mode}_{_safe_model(args.model)}_n{len(selected_ids)}" + out_dir = Path(args.output_dir or (_REPO_ROOT / "docs" / "peek-bench" / "runs" / run_tag)) + out_dir.mkdir(parents=True, exist_ok=True) + map_dir = out_dir / "peek_maps" + trace_dir = (out_dir / "traces") if args.trace else None + + run_config = vars(args) + run_config["dataset"] = "oolongbench/oolong-synth" + run_config["selected_context_ids"] = selected_ids + _write_json(out_dir / "run_config.json", run_config) + + # --- Evaluate --- + all_results: list[dict[str, Any]] = [] + for i, ctx_id in enumerate(selected_ids): + rows = groups[ctx_id] + context_text = rows[0].get("context_window_text_with_labels", "") or rows[0].get( + "context_window_text", "" + ) + print(f"\n[{i + 1}/{len(selected_ids)}] context_window_id={ctx_id}") + + result = evaluate_context_group( + rows, + context_text, + adapter, + peek_adapter, + mode=args.mode, + max_steps=args.max_steps, + max_tokens=args.max_tokens, + env_tips=args.env_tips, + evolve_steps=(None if args.evolve_steps < 0 else args.evolve_steps), + token_budget=args.token_budget, + peek_map_dir=map_dir, + peek_trace_dir=trace_dir, + context_window_id=ctx_id, + dry_run=False, + ) + all_results.append(result) + _write_json(out_dir / f"ctx_{ctx_id}.json", result) + + # --- Aggregate and report --- + summary = aggregate_results(all_results) + _write_json(out_dir / "summary.json", summary) + _write_json(out_dir / "all_results.json", all_results) + + print_summary( + summary, + { + "model": args.model, + "mode": args.mode, + "n_contexts": len(selected_ids), + "seed": args.seed, + "evolve_steps": args.evolve_steps, + "token_budget": args.token_budget, + }, + ) + print(f"\nArtifacts saved to: {out_dir}") + + +if __name__ == "__main__": + main() diff --git a/pyproject.toml b/pyproject.toml index 7f94ece..6d2111c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -43,6 +43,7 @@ tui = [ "textual>=8.2.3", "rich>=13.7", ] +peek = ["tiktoken>=0.7.0"] # peek-ai vendored under vendor/peek; only its runtime deps are needed here [project.urls] Homepage = "https://github.com/apenab/rlm-runtime" @@ -70,7 +71,7 @@ indent-style = "space" [tool.pytest.ini_options] testpaths = ["tests"] -pythonpath = ["src"] +pythonpath = ["src", "vendor"] addopts = "-q" [tool.commitizen] @@ -84,7 +85,7 @@ version_files = [ update_changelog_on_bump = true [tool.hatch.build.targets.wheel] -packages = ["src/pyrlm_runtime"] +packages = ["src/pyrlm_runtime", "vendor/peek"] [build-system] requires = ["hatchling"] diff --git a/src/pyrlm_runtime/peek_integration.py b/src/pyrlm_runtime/peek_integration.py new file mode 100644 index 0000000..c6d42f7 --- /dev/null +++ b/src/pyrlm_runtime/peek_integration.py @@ -0,0 +1,327 @@ +"""Optional integration with peek-ai (https://github.com/zhuohangu/peek). + +Adds PEEK orientation-cache support to pyrlm_runtime: a small, constant-sized +context map injected into the system prompt that accumulates reusable structural +knowledge about a recurring external context across multiple RLM runs. + +Requires the optional `peek` extra:: + + pip install pyrlm-runtime[peek] # or: pip install peek-ai + +Usage:: + + from pyrlm_runtime.peek_integration import PeekSession + + session = PeekSession.create(adapter=my_adapter) + for query in queries: + rlm = RLM(adapter=my_adapter, + system_prompt_supplement=session.system_prompt_supplement) + answer, trace = rlm.run(query, context) + session.update_from_run(trace, query) + + session.save("corpus.peek.json") +""" + +from __future__ import annotations + +import dataclasses +import json +from pathlib import Path +from typing import TYPE_CHECKING, Any + +from .adapters.base import ModelAdapter +from .trace import Trace + +if TYPE_CHECKING: + from peek import CachePolicy, UpdateResult + from peek.core.types import Usage as PeekUsage + + +def _to_jsonable(obj: Any) -> Any: + if dataclasses.is_dataclass(obj) and not isinstance(obj, type): + return {k: _to_jsonable(v) for k, v in dataclasses.asdict(obj).items()} + if isinstance(obj, dict): + return {k: _to_jsonable(v) for k, v in obj.items()} + if isinstance(obj, list | tuple): + return [_to_jsonable(v) for v in obj] + return obj + + +def _require_peek() -> None: + try: + import peek # noqa: F401 + except ImportError as exc: + raise ImportError( + "peek-ai is required for PEEK integration. " + "Install with: pip install pyrlm-runtime[peek] or pip install peek-ai" + ) from exc + + +class _PeekLMClientAdapter: + """Bridges pyrlm_runtime's ModelAdapter to the peek.LMClient protocol.""" + + def __init__(self, adapter: ModelAdapter, *, max_tokens: int = 2048) -> None: + self._adapter = adapter + self._max_tokens = max_tokens + self._last_input: int = 0 + self._last_output: int = 0 + + def completion(self, messages: list[dict[str, Any]]) -> str: + response = self._adapter.complete(messages, max_tokens=self._max_tokens) + self._last_input = response.usage.prompt_tokens + self._last_output = response.usage.completion_tokens + return response.text + + def last_usage(self) -> PeekUsage: + from peek.core.types import Usage as PeekUsage + + return PeekUsage(input_tokens=self._last_input, output_tokens=self._last_output) + + +def trace_to_peek_trajectory(trace: Trace, query: str = "") -> str: + """Serialize a Trace into the plain-string trajectory format that peek's Distiller expects. + + Root-level (depth=0) steps are rendered with full code/output. Nested subcall + steps (depth>0) are included with a depth marker so the Distiller can see + recursion but stays focused on root-level orientation work. + """ + parts: list[str] = [] + if query: + parts.append(f"Question: {query}\n") + + for step in trace.steps: + prefix = " " * step.depth + header = f"{prefix}--- STEP {step.step_id} [{step.kind}]" + if step.depth > 0: + header += f" (depth={step.depth})" + parts.append(header) + + if step.kind in ("root_call", "sub_root_call"): + if step.output: + for line in step.output.splitlines(): + parts.append(f"{prefix}{line}") + elif step.kind in ("repl_exec", "sub_repl_exec"): + if step.code: + parts.append(f"{prefix}Code:") + parts.append(f"{prefix}```python") + for line in step.code.splitlines(): + parts.append(f"{prefix}{line}") + parts.append(f"{prefix}```") + if step.stdout: + parts.append(f"{prefix}Output:") + for line in step.stdout.splitlines(): + parts.append(f"{prefix}{line}") + if step.error: + parts.append(f"{prefix}Error: {step.error}") + elif step.kind in ("subcall", "recursive_subcall", "sub_subcall"): + if step.output: + truncated = step.output[:500] + if len(step.output) > 500: + truncated += " [truncated]" + parts.append(f"{prefix}Result: {truncated}") + + return "\n".join(parts) + + +class PeekSession: + """Maintains a PEEK context map across RLM runs on a recurring external context. + + The map is injected via ``system_prompt_supplement`` and updated after each + run via ``update_from_run``. Evolution freezes after ``evolve_steps`` updates + (``None`` = evolve indefinitely). The map persists to disk via ``save``/``load``. + + When ``trace_dir`` is set, a per-update JSON snapshot is written to + ``trace_dir/q{NN}.json`` for offline diagnostic analysis. See + ``examples/peek_bench/analyze_peek_trace.py``. + """ + + def __init__(self, policy: CachePolicy, *, trace_dir: Path | None = None) -> None: + self._policy = policy + self._trace_dir = Path(trace_dir) if trace_dir is not None else None + if self._trace_dir is not None: + self._trace_dir.mkdir(parents=True, exist_ok=True) + + @classmethod + def create( + cls, + adapter: ModelAdapter, + *, + token_budget: int = 1024, + evolve_steps: int | None = None, + max_tokens: int = 2048, + trace_dir: str | Path | None = None, + ) -> PeekSession: + """Create a fresh session backed by the given ModelAdapter. + + Args: + adapter: ModelAdapter used for Distiller and Cartographer LLM calls. + token_budget: Hard token cap for the context map (B in the paper). Default: 1024. + evolve_steps: How many updates before the map is frozen. None = always evolve. + max_tokens: Max completion tokens for Distiller/Cartographer calls. + trace_dir: If set, write a JSON trace per ``update_from_run`` call. + """ + _require_peek() + from peek import CachePolicy + + lm_client = _PeekLMClientAdapter(adapter, max_tokens=max_tokens) + policy = CachePolicy( + client=lm_client, + token_budget=token_budget, + evolve_steps=evolve_steps, + ) + return cls(policy, trace_dir=trace_dir) + + @property + def system_prompt_supplement(self) -> str: + """Context map text to prepend to the RLM system prompt. + + Returns an empty string until the map has at least one entry, so the + first run (with an empty map) doesn't inject noise. + """ + from peek import ContextMap + + cmap = ContextMap(self._policy.current_map_text) + if not cmap.items(): + return "" + return f"\n\n## Context Map (Orientation Cache)\n{self._policy.current_map_text}" + + def update_from_run(self, trace: Trace, query: str = "") -> UpdateResult | None: + """Update the context map from the completed RLM run. + + Returns the UpdateResult (with per-step usage) or None if evolution is frozen. + When ``trace_dir`` is set, also writes a JSON snapshot of the update. + """ + trajectory = trace_to_peek_trajectory(trace, query=query) + + if self._trace_dir is None: + return self._policy.update(trajectory=trajectory, question=query) + + snapshot_before = self._snapshot_map() + step_idx = self._policy.steps + result = self._policy.update(trajectory=trajectory, question=query) + snapshot_after = self._snapshot_map() + + self._write_trace( + step_idx=step_idx, + query=query, + trajectory=trajectory, + snapshot_before=snapshot_before, + snapshot_after=snapshot_after, + result=result, + ) + return result + + def _snapshot_map(self) -> dict[str, Any]: + from peek import ContextMap + + cmap = ContextMap(self._policy.current_map_text) + items = [ + { + "id": it.id, + "section": it.section, + "content": it.content, + "score": float(self._policy.scores.get(it.id, 0.0)), + } + for it in cmap.items() + ] + return { + "text": self._policy.current_map_text, + "items": items, + "scores": {k: float(v) for k, v in self._policy.scores.items()}, + } + + def _write_trace( + self, + *, + step_idx: int, + query: str, + trajectory: str, + snapshot_before: dict[str, Any], + snapshot_after: dict[str, Any], + result: UpdateResult | None, + ) -> None: + assert self._trace_dir is not None + + before_ids = {it["id"] for it in snapshot_before["items"]} + after_ids = {it["id"] for it in snapshot_after["items"]} + before_by_id = {it["id"]: it for it in snapshot_before["items"]} + after_by_id = {it["id"]: it for it in snapshot_after["items"]} + + evicted = [ + {**before_by_id[i], "reason": "evicted_or_deleted"} + for i in (before_ids - after_ids) + ] + added = [after_by_id[i] for i in (after_ids - before_ids)] + modified = [ + { + "id": i, + "before": before_by_id[i]["content"], + "after": after_by_id[i]["content"], + "score": after_by_id[i]["score"], + } + for i in (before_ids & after_ids) + if before_by_id[i]["content"] != after_by_id[i]["content"] + ] + + payload: dict[str, Any] = { + "step_idx": step_idx, + "query": query, + "trajectory": trajectory, + "map_before": snapshot_before, + "map_after": snapshot_after, + "items_evicted": evicted, + "items_added": added, + "items_modified": modified, + "evolving": self._policy.evolving, + } + if result is None: + payload["result"] = None + else: + payload["result"] = { + "distiller": _to_jsonable(result.distiller), + "cartographer_raw": result.cartographer_raw, + "operations_applied": result.operations_applied, + "usage": _to_jsonable(result.usage), + } + + path = self._trace_dir / f"q{step_idx:02d}.json" + path.write_text( + json.dumps(payload, indent=2, ensure_ascii=False, default=str), + encoding="utf-8", + ) + + @property + def steps(self) -> int: + """Number of update calls so far.""" + return self._policy.steps + + @property + def evolving(self) -> bool: + """True while the map is still being updated.""" + return self._policy.evolving + + def save(self, path: str | Path) -> None: + """Persist the map and scores to a JSON file.""" + self._policy.save(path) + + @classmethod + def load( + cls, + path: str | Path, + adapter: ModelAdapter, + *, + max_tokens: int = 2048, + ) -> PeekSession: + """Load a previously saved session. + + Args: + path: Path to the JSON file created by ``save``. + adapter: ModelAdapter for future Distiller/Cartographer calls. + max_tokens: Max completion tokens for those calls. + """ + _require_peek() + from peek import CachePolicy + + lm_client = _PeekLMClientAdapter(adapter, max_tokens=max_tokens) + policy = CachePolicy.load(path, client=lm_client) + return cls(policy) diff --git a/tests/test_peek_integration.py b/tests/test_peek_integration.py new file mode 100644 index 0000000..e6378f4 --- /dev/null +++ b/tests/test_peek_integration.py @@ -0,0 +1,412 @@ +"""Tests for PeekSession and peek_integration helpers. + +These tests use a lightweight stub LMClient so they run without network access +or a real LLM. The stub returns hard-coded JSON that satisfies peek-ai's +Distiller and Cartographer parsers. +""" + +from __future__ import annotations + +import json +import tempfile +from pathlib import Path +from typing import Any + +import pytest + +try: + import peek # noqa: F401 + + PEEK_AVAILABLE = True +except ImportError: + PEEK_AVAILABLE = False + +from pyrlm_runtime.adapters.base import Usage +from pyrlm_runtime.adapters.fake import FakeAdapter +from pyrlm_runtime.peek_integration import ( + PeekSession, + _PeekLMClientAdapter, + trace_to_peek_trajectory, +) +from pyrlm_runtime.trace import Trace, TraceStep + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +_DISTILLER_RESPONSE = json.dumps( + { + "diagnosis": "Agent spent iterations exploring context structure.", + "item_tags": {}, + "cache_candidates": [ + { + "section": "context_roadmap", + "content": "Single text block of ~5k chars containing news articles.", + } + ], + } +) + +_CARTOGRAPHER_RESPONSE = json.dumps( + { + "reasoning": "Adding a roadmap entry for the corpus layout.", + "operations": [ + { + "type": "ADD", + "section": "context_roadmap", + "content": "Single text block of ~5k chars containing news articles.", + } + ], + } +) + + +class _StubLMClient: + """Stub that alternates between Distiller and Cartographer responses. + + peek calls the client twice per update: once for the Distiller, once for + the Cartographer. The stub returns the appropriate JSON based on call order. + """ + + def __init__(self) -> None: + self._calls = 0 + + def completion(self, messages: list[dict[str, Any]]) -> str: + idx = self._calls % 2 + self._calls += 1 + return [_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE][idx] + + def last_usage(self): + from peek.core.types import Usage as PeekUsage + + return PeekUsage(input_tokens=100, output_tokens=50) + + +def _make_simple_trace() -> Trace: + trace = Trace(steps=[]) + trace.add( + TraceStep( + step_id=1, + kind="root_call", + depth=0, + output="I will inspect the context to find the answer.\n```python\nprint(P[:500])\n```", + usage=Usage(prompt_tokens=200, completion_tokens=50, total_tokens=250), + ) + ) + trace.add( + TraceStep( + step_id=2, + kind="repl_exec", + depth=0, + code="print(P[:500])", + stdout="Breaking news: economy grows...", + usage=None, + ) + ) + trace.add( + TraceStep( + step_id=3, + kind="root_call", + depth=0, + output="FINAL(World)", + usage=Usage(prompt_tokens=300, completion_tokens=20, total_tokens=320), + ) + ) + return trace + + +# --------------------------------------------------------------------------- +# trace_to_peek_trajectory +# --------------------------------------------------------------------------- + + +class TestTraceToTrajectory: + def test_empty_trace_returns_empty(self) -> None: + trace = Trace(steps=[]) + result = trace_to_peek_trajectory(trace) + assert result == "" + + def test_query_prepended(self) -> None: + trace = Trace(steps=[]) + result = trace_to_peek_trajectory(trace, query="How many sports articles?") + assert result.startswith("Question: How many sports articles?") + + def test_root_call_output_included(self) -> None: + trace = _make_simple_trace() + result = trace_to_peek_trajectory(trace, query="Q") + assert "FINAL(World)" in result + assert "I will inspect the context" in result + + def test_repl_exec_code_and_stdout_included(self) -> None: + trace = _make_simple_trace() + result = trace_to_peek_trajectory(trace) + assert "print(P[:500])" in result + assert "Breaking news" in result + + def test_step_headers_present(self) -> None: + trace = _make_simple_trace() + result = trace_to_peek_trajectory(trace) + assert "STEP 1 [root_call]" in result + assert "STEP 2 [repl_exec]" in result + + def test_subcall_depth_marker(self) -> None: + trace = Trace(steps=[]) + trace.add( + TraceStep( + step_id=1, + kind="subcall", + depth=1, + output="subcall result", + usage=None, + ) + ) + result = trace_to_peek_trajectory(trace) + assert "depth=1" in result + assert "subcall result" in result + + def test_long_subcall_output_truncated(self) -> None: + trace = Trace(steps=[]) + trace.add( + TraceStep( + step_id=1, + kind="subcall", + depth=1, + output="x" * 600, + usage=None, + ) + ) + result = trace_to_peek_trajectory(trace) + assert "[truncated]" in result + + def test_error_included(self) -> None: + trace = Trace(steps=[]) + trace.add( + TraceStep( + step_id=1, + kind="repl_exec", + depth=0, + code="1/0", + stdout=None, + error="ZeroDivisionError: division by zero", + usage=None, + ) + ) + result = trace_to_peek_trajectory(trace) + assert "ZeroDivisionError" in result + + +# --------------------------------------------------------------------------- +# _PeekLMClientAdapter +# --------------------------------------------------------------------------- + + +class TestPeekLMClientAdapter: + def test_completion_returns_text(self) -> None: + fake = FakeAdapter(script=["hello from fake"]) + adapter = _PeekLMClientAdapter(fake, max_tokens=256) + result = adapter.completion([{"role": "user", "content": "ping"}]) + assert result == "hello from fake" + + def test_last_usage_maps_tokens(self) -> None: + if not PEEK_AVAILABLE: + pytest.skip("peek-ai not installed") + fake = FakeAdapter(script=["response text"]) + adapter = _PeekLMClientAdapter(fake, max_tokens=256) + adapter.completion([{"role": "user", "content": "ping"}]) + usage = adapter.last_usage() + assert usage.input_tokens > 0 + assert usage.output_tokens > 0 + + def test_usage_reflects_most_recent_call(self) -> None: + if not PEEK_AVAILABLE: + pytest.skip("peek-ai not installed") + fake = FakeAdapter( + rules=[], + script=["first response", "second response with more tokens here"], + ) + adapter = _PeekLMClientAdapter(fake, max_tokens=256) + adapter.completion([{"role": "user", "content": "a"}]) + u1 = adapter.last_usage() + adapter.completion([{"role": "user", "content": "a"}]) + u2 = adapter.last_usage() + assert u2.output_tokens >= u1.output_tokens + + +# --------------------------------------------------------------------------- +# PeekSession +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not PEEK_AVAILABLE, reason="peek-ai not installed") +class TestPeekSession: + def _session_with_stub(self, **kwargs) -> PeekSession: + from peek import CachePolicy + + stub = _StubLMClient() + policy = CachePolicy(client=stub, **kwargs) + return PeekSession(policy) + + def test_system_prompt_supplement_empty_before_update(self) -> None: + session = self._session_with_stub(token_budget=1024) + assert session.system_prompt_supplement == "" + + def test_update_from_run_populates_map(self) -> None: + session = self._session_with_stub(token_budget=1024) + trace = _make_simple_trace() + result = session.update_from_run(trace, query="What is the topic?") + assert result is not None + assert result.operations_applied >= 0 + + def test_system_prompt_supplement_nonempty_after_update(self) -> None: + session = self._session_with_stub(token_budget=1024) + session.update_from_run(_make_simple_trace(), query="Q") + supp = session.system_prompt_supplement + # After update, the Cartographer added a roadmap entry — map is non-empty + assert supp == "" or "Context Map" in supp + + def test_steps_counter_increments(self) -> None: + session = self._session_with_stub(token_budget=1024) + assert session.steps == 0 + session.update_from_run(_make_simple_trace()) + assert session.steps == 1 + session.update_from_run(_make_simple_trace()) + assert session.steps == 2 + + def test_evolve_steps_freezes_map(self) -> None: + session = self._session_with_stub(token_budget=1024, evolve_steps=1) + assert session.evolving is True + session.update_from_run(_make_simple_trace()) + assert session.evolving is False + result = session.update_from_run(_make_simple_trace()) + assert result is None # frozen: returns None + + def test_empty_trace_does_not_crash(self) -> None: + session = self._session_with_stub(token_budget=1024) + trace = Trace(steps=[]) + result = session.update_from_run(trace, query="Q") + assert result is not None # Distiller/Cartographer still called with empty trajectory + + def test_save_and_load_roundtrip(self) -> None: + from peek import CachePolicy + + stub = _StubLMClient() + policy = CachePolicy(client=stub, token_budget=512, evolve_steps=3) + session = PeekSession(policy) + session.update_from_run(_make_simple_trace(), query="Q1") + + with tempfile.TemporaryDirectory() as tmpdir: + path = Path(tmpdir) / "map.peek.json" + session.save(path) + assert path.exists() + + payload = json.loads(path.read_text()) + assert "map_text" in payload + assert payload["token_budget"] == 512 + assert payload["evolve_steps"] == 3 + assert payload["steps"] == 1 + + # Load and verify state is preserved + stub2 = _StubLMClient() + policy2 = CachePolicy.load(path, client=stub2) + session2 = PeekSession(policy2) + assert session2.steps == 1 + assert session2.evolving is True + + def test_create_factory_uses_model_adapter(self) -> None: + fake = FakeAdapter(script=[_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE]) + session = PeekSession.create(fake, token_budget=512) + assert isinstance(session, PeekSession) + assert session.steps == 0 + + def test_load_raises_without_peek(self, monkeypatch) -> None: + # When peek is importable but we simulate the ImportError path via _require_peek + # This test just verifies the happy path of create() doesn't crash. + fake = FakeAdapter(script=[_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE]) + session = PeekSession.create(fake, token_budget=512) + assert session is not None + + +# --------------------------------------------------------------------------- +# Regression: wiring — update_from_run calls policy.update with trajectory str +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not PEEK_AVAILABLE, reason="peek-ai not installed") +def test_regression_trajectory_string_passed_to_policy() -> None: + """update_from_run must pass a non-empty string to policy when trace has steps.""" + from peek import CachePolicy + + received: list[str] = [] + + class CapturingClient: + def completion(self, messages): + content = messages[0]["content"] if messages else "" + received.append(content) + # Return valid Distiller JSON on first call, Cartographer JSON on second + idx = len(received) - 1 + return [_DISTILLER_RESPONSE, _CARTOGRAPHER_RESPONSE][idx % 2] + + def last_usage(self): + from peek.core.types import Usage as PeekUsage + + return PeekUsage(input_tokens=10, output_tokens=5) + + policy = CachePolicy(client=CapturingClient(), token_budget=1024) + session = PeekSession(policy) + trace = _make_simple_trace() + session.update_from_run(trace, query="Test question") + + # Distiller should receive a non-empty trajectory in its prompt + assert len(received) >= 1 + distiller_prompt = received[0] + assert "STEP" in distiller_prompt or "Question" in distiller_prompt + assert "FINAL(World)" in distiller_prompt + + +# --------------------------------------------------------------------------- +# Regression: vendor/peek/_io.py extract_json must not catastrophically +# backtrack on LLM output that opens a code fence without closing it. This +# was the bug behind a 45-minute hung benchmark run. +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not PEEK_AVAILABLE, reason="peek-ai not installed") +def test_extract_json_handles_unclosed_fence_quickly() -> None: + import time + + from peek._io import extract_json + + # 100KB string with an opening fence and no closing fence — the kind of + # output an LLM might emit when truncated mid-response. The pre-fix + # regex `(.*?)\s*```` with DOTALL would backtrack exponentially here. + payload = "```json\n" + ("a" * 100_000) + + t0 = time.perf_counter() + result = extract_json(payload) + elapsed = time.perf_counter() - t0 + + # Should give up immediately, not hang. The original bug took >2 minutes + # on a 1KB input; we generously allow 1s for a 100KB input. + assert elapsed < 1.0, f"extract_json took {elapsed:.3f}s on unclosed fence" + # Returning None is acceptable; what matters is that it returned at all. + assert result is None or isinstance(result, dict) + + +@pytest.mark.skipif(not PEEK_AVAILABLE, reason="peek-ai not installed") +def test_extract_json_still_finds_valid_fenced_json() -> None: + """Make sure the fix didn't regress the happy path.""" + from peek._io import extract_json + + payload = 'some preamble\n```json\n{"a": 1, "b": [2, 3]}\n```\ntrailing' + result = extract_json(payload) + assert result == {"a": 1, "b": [2, 3]} + + +@pytest.mark.skipif(not PEEK_AVAILABLE, reason="peek-ai not installed") +def test_extract_json_finds_second_block_when_first_invalid() -> None: + from peek._io import extract_json + + payload = '```\nnot json at all\n```\n```json\n{"x": 42}\n```' + result = extract_json(payload) + assert result == {"x": 42} diff --git a/vendor/peek/LICENSE b/vendor/peek/LICENSE new file mode 100644 index 0000000..f49a4e1 --- /dev/null +++ b/vendor/peek/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/vendor/peek/VENDOR.md b/vendor/peek/VENDOR.md new file mode 100644 index 0000000..83ecf33 --- /dev/null +++ b/vendor/peek/VENDOR.md @@ -0,0 +1,30 @@ +# Vendored copy of peek-ai + +This directory contains a vendored copy of `peek-ai` +([zhuohangu/peek](https://github.com/zhuohangu/peek)) at commit +`57de91ac` (2026-05-20). License: Apache-2.0 (see `LICENSE`). + +The package imports as `peek` exactly as the upstream PyPI/Git install +would, so `from peek import CachePolicy` etc. is unchanged. + +## Why vendored + +We patch peek-ai internals (Distiller, Cartographer, Evictor) to +experiment with improvements identified in our benchmark on +`oolongbench/oolong-synth`. Vendoring lets us iterate freely while +keeping a clean upstream reference for diffing. + +## Patches + +Each patch is documented inline at the change site with a `# peek-patch:` +comment and corresponds to a hypothesis in +`docs/peek-bench/PEEK-EXPERIMENTS.md` (Phase 4). + +When the patch list stabilises and improves results meaningfully, the +intent is to contribute them back upstream. + +## Refreshing from upstream + +To re-vendor from a newer upstream commit, replace the files under this +directory with the new content and re-apply the `# peek-patch:` markers. +Update the commit ref above. diff --git a/vendor/peek/__init__.py b/vendor/peek/__init__.py new file mode 100644 index 0000000..a784cd1 --- /dev/null +++ b/vendor/peek/__init__.py @@ -0,0 +1,56 @@ +"""PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents.""" + +from peek.core import ( + SECTIONS, + CachePolicy, + Cartographer, + CartographerOutput, + ContextMap, + Distiller, + DistillerOutput, + Item, + ItemTag, + Operation, + UpdateResult, + Usage, + evict, + update_scores, +) +from peek.llm import LMClient + +__version__ = "0.1.0" + +__all__ = [ + "CachePolicy", + "Cartographer", + "CartographerOutput", + "ContextMap", + "Distiller", + "DistillerOutput", + "Item", + "ItemTag", + "LMClient", + "Operation", + "SECTIONS", + "UpdateResult", + "Usage", + "__version__", + "evict", + "update_scores", +] + + +def __getattr__(name: str): + if name == "OpenAIClient": + from peek.llm.openai_client import OpenAIClient + + return OpenAIClient + if name == "AnthropicClient": + from peek.llm.anthropic_client import AnthropicClient + + return AnthropicClient + if name == "GeminiClient": + from peek.llm.gemini_client import GeminiClient + + return GeminiClient + raise AttributeError(name) diff --git a/vendor/peek/_io.py b/vendor/peek/_io.py new file mode 100644 index 0000000..9ea2ee3 --- /dev/null +++ b/vendor/peek/_io.py @@ -0,0 +1,91 @@ +from __future__ import annotations + +import json +from pathlib import Path +from typing import Any + +PROMPTS_DIR = Path(__file__).resolve().parent / "prompts" + + +def load_prompt(name: str) -> str: + return (PROMPTS_DIR / name).read_text(encoding="utf-8") + + +def extract_json(text: str) -> dict[str, Any] | None: + """Best-effort JSON extraction from an LLM response.""" + s = text.strip() + try: + return json.loads(s) + except json.JSONDecodeError: + pass + + for block in _scan_fenced_blocks(text): + try: + return json.loads(block.strip()) + except json.JSONDecodeError: + continue + + for blob in _scan_balanced_braces(text): + try: + return json.loads(blob) + except json.JSONDecodeError: + continue + return None + + +def _scan_fenced_blocks(text: str) -> list[str]: + """Return the contents of every ```...``` fenced block in ``text``. + + peek-patch BUG-FIX: the upstream implementation used + ``re.findall(r'```(?:json)?\\s*(.*?)\\s*```', text, re.DOTALL | re.IGNORECASE)`` + which can exhibit catastrophic backtracking when the LLM emits an opening + fence without a matching closing fence (a real failure mode observed in + a 45-minute hung benchmark run). This O(n) state machine is bulletproof + against malformed input: unclosed fences are ignored gracefully instead + of triggering exponential regex search. + """ + out: list[str] = [] + i, n = 0, len(text) + while i < n: + start = text.find("```", i) + if start < 0: + return out + # Skip the fence marker; also skip an optional "json" tag and any + # trailing whitespace on that line up to the first newline. + after = start + 3 + nl = text.find("\n", after) + if nl < 0: + return out + end = text.find("```", nl + 1) + if end < 0: + # Unclosed fence: bail out instead of treating the rest of the + # document as the (possibly enormous) match body. + return out + out.append(text[nl + 1 : end]) + i = end + 3 + return out + + +def _scan_balanced_braces(text: str) -> list[str]: + out: list[str] = [] + i, n = 0, len(text) + while i < n: + if text[i] != "{": + i += 1 + continue + depth, start = 1, i + i += 1 + while i < n and depth > 0: + c = text[i] + if c == '"': + i += 1 + while i < n and text[i] != '"': + i += 2 if text[i] == "\\" else 1 + elif c == "{": + depth += 1 + elif c == "}": + depth -= 1 + i += 1 + if depth == 0: + out.append(text[start:i]) + return out diff --git a/vendor/peek/core/__init__.py b/vendor/peek/core/__init__.py new file mode 100644 index 0000000..67ab18c --- /dev/null +++ b/vendor/peek/core/__init__.py @@ -0,0 +1,32 @@ +from peek.core.cartographer import Cartographer +from peek.core.context_map import ContextMap, normalize_section +from peek.core.distiller import Distiller +from peek.core.evictor import evict, update_scores +from peek.core.policy import CachePolicy, UpdateResult +from peek.core.types import ( + SECTIONS, + CartographerOutput, + DistillerOutput, + Item, + ItemTag, + Operation, + Usage, +) + +__all__ = [ + "CachePolicy", + "Cartographer", + "CartographerOutput", + "ContextMap", + "Distiller", + "DistillerOutput", + "Item", + "ItemTag", + "Operation", + "SECTIONS", + "Usage", + "UpdateResult", + "evict", + "normalize_section", + "update_scores", +] diff --git a/vendor/peek/core/cartographer.py b/vendor/peek/core/cartographer.py new file mode 100644 index 0000000..721fce3 --- /dev/null +++ b/vendor/peek/core/cartographer.py @@ -0,0 +1,77 @@ +from __future__ import annotations + +from peek._io import extract_json, load_prompt +from peek.core.context_map import normalize_section +from peek.core.types import SECTIONS, CartographerOutput, Operation +from peek.llm.base import LMClient + +_ALLOWED_SECTIONS: frozenset[str] = frozenset(SECTIONS) + + +class Cartographer: + """Translates Distiller output into structured edits against the context map. + + Edits are validated: only ADD/DELETE/REPLACE operations targeting whitelisted + sections are returned. The operations are intended to be applied via + :meth:`peek.core.context_map.ContextMap.apply`. + """ + + def __init__(self, client: LMClient, prompt: str | None = None): + self.client = client + self.prompt = prompt or load_prompt("cartographer.txt") + + def __call__( + self, + *, + reflection: str, + current_map: str, + question: str, + token_budget: int, + current_tokens: int, + ) -> CartographerOutput: + content = self.prompt.format( + reflection=reflection, + current_playbook=current_map, + question_context=question, + token_budget=token_budget, + current_tokens=current_tokens, + ) + raw = self.client.completion([{"role": "user", "content": content}]) or "" + usage = self.client.last_usage() + return CartographerOutput(raw=raw, operations=_parse_ops(raw), usage=usage) + + +def _parse_ops(raw: str) -> list[Operation]: + parsed = extract_json(raw) + if not isinstance(parsed, dict): + return [] + if not isinstance(parsed.get("reasoning"), str): + return [] + raw_ops = parsed.get("operations") + if not isinstance(raw_ops, list): + return [] + + out: list[Operation] = [] + for op in raw_ops: + if not isinstance(op, dict): + continue + kind = op.get("type") + if kind == "ADD": + section = op.get("section") + content = op.get("content") + if not isinstance(section, str) or not isinstance(content, str) or not content: + continue + section_norm = normalize_section(section) + if section_norm not in _ALLOWED_SECTIONS: + continue + out.append(Operation(type="ADD", section=section_norm, content=content)) + elif kind == "DELETE": + item_id = op.get("item_id") or op.get("bullet_id") + if isinstance(item_id, str) and item_id: + out.append(Operation(type="DELETE", item_id=item_id)) + elif kind == "REPLACE": + item_id = op.get("item_id") or op.get("bullet_id") + content = op.get("content") + if isinstance(item_id, str) and item_id and isinstance(content, str) and content: + out.append(Operation(type="REPLACE", item_id=item_id, content=content)) + return out diff --git a/vendor/peek/core/context_map.py b/vendor/peek/core/context_map.py new file mode 100644 index 0000000..3499bde --- /dev/null +++ b/vendor/peek/core/context_map.py @@ -0,0 +1,144 @@ +"""Section-indexed, line-addressable context map. + +Each item is rendered as a single line: ``[-NNNNN] `` under a +section heading ``##
``. Item IDs are stable across edits so the +Cartographer can reference them by ID. +""" + +from __future__ import annotations + +import re +from dataclasses import dataclass +from pathlib import Path + +from peek._io import load_prompt +from peek.core.types import SECTION_SLUG, SECTIONS, Item, Operation + +_ITEM_RE = re.compile(r"^\[([^\]]+)\]\s*(.*)$") +_SLUG_TAIL = re.compile(r"-(\d+)$") + + +def normalize_section(name: str) -> str: + return name.lower().strip().replace(" ", "_").replace("-", "_").rstrip(":") + + +@dataclass +class ContextMap: + text: str + + @classmethod + def initial(cls) -> ContextMap: + return cls(load_prompt("initial_context_map.txt").rstrip() + "\n") + + @classmethod + def from_file(cls, path: str | Path) -> ContextMap: + return cls(Path(path).read_text(encoding="utf-8")) + + def save(self, path: str | Path) -> None: + p = Path(path) + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(self.text, encoding="utf-8") + + def items(self) -> list[Item]: + out: list[Item] = [] + section = "general" + for line in self.text.splitlines(): + stripped = line.strip() + if stripped.startswith("##"): + section = normalize_section(stripped.lstrip("#").strip()) + continue + m = _ITEM_RE.match(stripped) + if m: + out.append(Item(id=m.group(1), section=section, content=m.group(2).strip())) + return out + + def item_ids(self) -> list[str]: + return [it.id for it in self.items()] + + def next_id_int(self) -> int: + n = 0 + for it in self.items(): + m = _SLUG_TAIL.search(it.id) + if m: + n = max(n, int(m.group(1))) + return n + 1 + + def apply(self, operations: list[Operation]) -> ContextMap: + if not operations: + return ContextMap(self.text) + + lines = self.text.splitlines() + deletes: set[str] = set() + replaces: dict[str, str] = {} + adds: list[tuple[str, str]] = [] # (section, line) + next_id = self.next_id_int() + + for op in operations: + if op.type == "DELETE" and op.item_id: + deletes.add(op.item_id) + elif op.type == "REPLACE" and op.item_id and op.content: + replaces[op.item_id] = op.content + elif op.type == "ADD" and op.section and op.content: + section = normalize_section(op.section) + slug = SECTION_SLUG.get(section, section[:4]) + line = f"[{slug}-{next_id:05d}] {op.content}" + next_id += 1 + adds.append((section, line)) + + out_lines: list[str] = [] + current: str | None = None + + def flush(section: str | None) -> None: + if section is None: + return + picks = [ln for s, ln in adds if s == section] + if picks: + out_lines.extend(picks) + adds[:] = [(s, ln) for s, ln in adds if s != section] + + for line in lines: + stripped = line.strip() + if stripped.startswith("##"): + flush(current) + if out_lines and out_lines[-1] != "": + out_lines.append("") + current = normalize_section(stripped.lstrip("#").strip()) + out_lines.append(line) + continue + + m = _ITEM_RE.match(stripped) + if m: + bid = m.group(1) + if bid in deletes: + continue + if bid in replaces: + out_lines.append(f"[{bid}] {replaces[bid]}") + continue + out_lines.append(line) + + flush(current) + + # Leftover ADDs (sections not present): append at the end under their headers. + if adds: + buckets: dict[str, list[str]] = {} + for s, ln in adds: + buckets.setdefault(s, []).append(ln) + for section in [s for s in SECTIONS if s in buckets] + [ + s for s in buckets if s not in SECTIONS + ]: + if out_lines and out_lines[-1] != "": + out_lines.append("") + out_lines.append(f"## {section.upper().replace('_', ' ')}") + out_lines.extend(buckets[section]) + + return ContextMap(_collapse_blank_lines("\n".join(out_lines)) + "\n") + + +def _collapse_blank_lines(text: str) -> str: + out: list[str] = [] + for line in text.split("\n"): + if not line.strip(): + if out and not out[-1].strip(): + continue + out.append(line) + return "\n".join(out).rstrip() diff --git a/vendor/peek/core/distiller.py b/vendor/peek/core/distiller.py new file mode 100644 index 0000000..d25b107 --- /dev/null +++ b/vendor/peek/core/distiller.py @@ -0,0 +1,63 @@ +from __future__ import annotations + +from peek._io import extract_json, load_prompt +from peek.core.types import DistillerOutput, ItemTag, Usage +from peek.llm.base import LMClient + +_VALID_TAGS = {"helpful", "harmful", "neutral", "stale"} + + +class Distiller: + """Extracts transferable contextual knowledge from an agent trajectory. + + Produces a diagnosis, per-item tags for the current map, and a set of + cache candidates. Uses the no-ground-truth prompt that shipped with the + paper experiments by default. + """ + + def __init__(self, client: LMClient, prompt: str | None = None): + self.client = client + self.prompt = prompt or load_prompt("distiller.txt") + + def __call__( + self, + trajectory: str, + context_map: str, + *, + question: str = "", + ) -> DistillerOutput: + content = self.prompt.format( + playbook=context_map or "N/A", + trace_history=trajectory, + ) + if question: + content += f"\n\n- Task context (the question the agent was answering):\n{question}\n" + + raw = self.client.completion([{"role": "user", "content": content}]) or "" + usage = self.client.last_usage() + return _parse(raw, usage) + + +def _parse(raw: str, usage: Usage) -> DistillerOutput: + parsed = extract_json(raw) + if not isinstance(parsed, dict): + return DistillerOutput(raw=raw, diagnosis=raw, usage=usage) + + tags_in = parsed.get("item_tags") or parsed.get("bullet_tags") or {} + tags: dict[str, ItemTag] = {} + if isinstance(tags_in, dict): + for k, v in tags_in.items(): + if isinstance(v, str) and v in _VALID_TAGS: + tags[str(k)] = v # type: ignore[assignment] + + candidates = parsed.get("cache_candidates") or [] + if not isinstance(candidates, list): + candidates = [] + + return DistillerOutput( + raw=raw, + diagnosis=str(parsed.get("diagnosis", "")), + item_tags=tags, + cache_candidates=candidates, + usage=usage, + ) diff --git a/vendor/peek/core/evictor.py b/vendor/peek/core/evictor.py new file mode 100644 index 0000000..3f79191 --- /dev/null +++ b/vendor/peek/core/evictor.py @@ -0,0 +1,89 @@ +"""Priority-based eviction enforcing a hard token budget on the context map. + +Items are evicted in ascending order of their accumulated Distiller score, +ties broken by item age (older IDs evicted first). See §3.2 of the PEEK paper. + +peek-patch C3 (score decay): +existing scores are multiplied by ``SCORE_DECAY`` before each tagging pass, +and ``neutral`` tags contribute ``NEUTRAL_PENALTY`` (≥ 0) instead of 0. +Setting both to ``1.0`` / ``0.0`` recovers upstream behaviour exactly. +This removes the score-zero stickiness failure mode documented in +``docs/peek-bench/PEEK-DIAGNOSIS.md`` — items the Distiller cannot +positively endorse drift toward eviction over time, freeing slots for +newer evidence. Validated empirically: +5.7pp aggregate vs upstream on +a 5-context oolong-synth in-session A/B (see ``PEEK-EXPERIMENTS.md`` +Phase 4.C.1). +""" + +from __future__ import annotations + +import re +from collections.abc import Callable + +from peek.core.context_map import ContextMap +from peek.core.types import ItemTag + +_NUMERIC_TAIL = re.compile(r"-(\d+)$") + +# peek-patch C3. Upstream behaviour: SCORE_DECAY=1.0, NEUTRAL_PENALTY=0.0. +SCORE_DECAY: float = 0.85 +NEUTRAL_PENALTY: float = 0.5 + + +def update_scores( + scores: dict[str, float], + tags: dict[str, ItemTag], + *, + decay: float = SCORE_DECAY, + neutral_penalty: float = NEUTRAL_PENALTY, +) -> dict[str, float]: + """Apply decay to ``scores`` then add Distiller tag contributions.""" + out: dict[str, float] = {k: float(v) * decay for k, v in scores.items()} + for item_id, tag in tags.items(): + if tag == "helpful": + out[item_id] = out.get(item_id, 0.0) + 1.0 + elif tag in ("harmful", "stale"): + out[item_id] = out.get(item_id, 0.0) - 1.0 + elif tag == "neutral": + out[item_id] = out.get(item_id, 0.0) - neutral_penalty + else: + out.setdefault(item_id, 0.0) + return out + + +def evict( + cmap: ContextMap, + scores: dict[str, float], + token_budget: int, + token_counter: Callable[[str], int], +) -> ContextMap: + if token_counter(cmap.text) <= token_budget: + return cmap + + ordered_ids = [it.id for it in cmap.items()] + ordered_ids.sort(key=lambda bid: (scores.get(bid, 0), _id_age(bid))) + + removed: set[str] = set() + for bid in ordered_ids: + removed.add(bid) + trial = _strip_items(cmap.text, removed) + if token_counter(trial) <= token_budget: + return ContextMap(trial + "\n" if not trial.endswith("\n") else trial) + return ContextMap(_strip_items(cmap.text, set(ordered_ids))) + + +def _id_age(item_id: str) -> int: + m = _NUMERIC_TAIL.search(item_id) + return int(m.group(1)) if m else 0 + + +def _strip_items(text: str, ids: set[str]) -> str: + out: list[str] = [] + for line in text.splitlines(): + stripped = line.strip() + if stripped.startswith("[") and "]" in stripped: + bid = stripped[1 : stripped.index("]")] + if bid in ids: + continue + out.append(line) + return "\n".join(out) diff --git a/vendor/peek/core/policy.py b/vendor/peek/core/policy.py new file mode 100644 index 0000000..6443a52 --- /dev/null +++ b/vendor/peek/core/policy.py @@ -0,0 +1,156 @@ +"""Programmable cache policy implementing PEEK Algorithm 1. + +The policy wraps a single context map and an LM-backed Distiller and +Cartographer. After each agent run on a recurring external context, the caller +hands the trajectory to :meth:`CachePolicy.update`; for the first +``evolve_steps`` calls the map is updated, otherwise it is reused as-is. +""" + +from __future__ import annotations + +import json +from collections.abc import Callable +from dataclasses import dataclass, field +from pathlib import Path + +from peek.core.cartographer import Cartographer +from peek.core.context_map import ContextMap +from peek.core.distiller import Distiller +from peek.core.evictor import evict, update_scores +from peek.core.types import DistillerOutput, ItemTag, Usage +from peek.llm.base import LMClient + +TokenCounter = Callable[[str], int] + + +def _default_tokenizer() -> TokenCounter: + import tiktoken + + enc = tiktoken.get_encoding("o200k_base") + return lambda s: len(enc.encode(s)) + + +@dataclass +class UpdateResult: + distiller: DistillerOutput + cartographer_raw: str + operations_applied: int + map_text: str + usage: Usage + + +@dataclass +class CachePolicy: + """Maintains a single context map for a recurring external context. + + Parameters + ---------- + client : LMClient + Language-model client used by both Distiller and Cartographer. + token_budget : int + Hard token budget enforced by the Evictor after each update. + evolve_steps : int | None + Number of update calls during which the map is allowed to evolve. + ``None`` means evolve indefinitely (m = n in the paper). + cmap : ContextMap | None + Starting map. Defaults to the paper's initial context map. + token_counter : callable | None + ``str -> int`` token counter. Defaults to ``tiktoken`` ``o200k_base``. + """ + + client: LMClient + token_budget: int = 1024 + evolve_steps: int | None = None + cmap: ContextMap = field(default_factory=ContextMap.initial) + token_counter: TokenCounter | None = None + # peek-patch C3 — scores carry float values after score-decay was introduced. + scores: dict[str, float] = field(default_factory=dict) + steps: int = 0 + + def __post_init__(self) -> None: + self._distiller = Distiller(self.client) + self._cartographer = Cartographer(self.client) + if self.token_counter is None: + self.token_counter = _default_tokenizer() + + @property + def current_map_text(self) -> str: + return self.cmap.text + + @property + def evolving(self) -> bool: + return self.evolve_steps is None or self.steps < self.evolve_steps + + def update( + self, + *, + trajectory: str, + question: str = "", + ) -> UpdateResult | None: + """Run one cache-policy step. Returns ``None`` when evolution is frozen.""" + if not self.evolving: + self.steps += 1 + return None + + distilled = self._distiller( + trajectory, + self.cmap.text, + question=question, + ) + self.scores = update_scores(self.scores, distilled.item_tags) + + assert self.token_counter is not None + edits = self._cartographer( + reflection=distilled.raw, + current_map=self.cmap.text, + question=question, + token_budget=self.token_budget, + current_tokens=self.token_counter(self.cmap.text), + ) + if edits.operations: + self.cmap = self.cmap.apply(edits.operations) + self.cmap = evict(self.cmap, self.scores, self.token_budget, self.token_counter) + self.steps += 1 + + return UpdateResult( + distiller=distilled, + cartographer_raw=edits.raw, + operations_applied=len(edits.operations), + map_text=self.cmap.text, + usage=distilled.usage + edits.usage, + ) + + def tag(self, item_id: str, tag: ItemTag) -> None: + """Apply a manual Distiller-equivalent tag to a single item.""" + self.scores = update_scores(self.scores, {item_id: tag}) + + def save(self, path: str | Path) -> None: + path = Path(path) + path.parent.mkdir(parents=True, exist_ok=True) + payload = { + "map_text": self.cmap.text, + "scores": self.scores, + "steps": self.steps, + "token_budget": self.token_budget, + "evolve_steps": self.evolve_steps, + } + path.write_text(json.dumps(payload, indent=2), encoding="utf-8") + + @classmethod + def load( + cls, + path: str | Path, + *, + client: LMClient, + token_counter: TokenCounter | None = None, + ) -> CachePolicy: + payload = json.loads(Path(path).read_text(encoding="utf-8")) + return cls( + client=client, + token_budget=int(payload.get("token_budget", 1024)), + evolve_steps=payload.get("evolve_steps"), + cmap=ContextMap(payload["map_text"]), + token_counter=token_counter, + scores=dict(payload.get("scores", {})), + steps=int(payload.get("steps", 0)), + ) diff --git a/vendor/peek/core/types.py b/vendor/peek/core/types.py new file mode 100644 index 0000000..3550585 --- /dev/null +++ b/vendor/peek/core/types.py @@ -0,0 +1,68 @@ +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Literal + +ItemTag = Literal["helpful", "harmful", "neutral", "stale"] +OpType = Literal["ADD", "DELETE", "REPLACE"] + +SECTIONS: tuple[str, ...] = ( + "context_roadmap", + "context_understanding", + "domain_constants", + "parsing_schema", + "error_patterns", + "reusable_results", +) + +SECTION_SLUG: dict[str, str] = { + "context_roadmap": "cr", + "context_understanding": "cu", + "domain_constants": "dc", + "parsing_schema": "ps", + "error_patterns": "ep", + "reusable_results": "rr", +} + + +@dataclass(frozen=True) +class Item: + id: str + section: str + content: str + + +@dataclass(frozen=True) +class Operation: + type: OpType + section: str | None = None + item_id: str | None = None + content: str | None = None + + +@dataclass(frozen=True) +class Usage: + input_tokens: int = 0 + output_tokens: int = 0 + + def __add__(self, other: Usage) -> Usage: + return Usage( + self.input_tokens + other.input_tokens, + self.output_tokens + other.output_tokens, + ) + + +@dataclass +class DistillerOutput: + raw: str + diagnosis: str + item_tags: dict[str, ItemTag] = field(default_factory=dict) + cache_candidates: list[dict] = field(default_factory=list) + usage: Usage = field(default_factory=Usage) + + +@dataclass +class CartographerOutput: + raw: str + operations: list[Operation] = field(default_factory=list) + usage: Usage = field(default_factory=Usage) diff --git a/vendor/peek/llm/__init__.py b/vendor/peek/llm/__init__.py new file mode 100644 index 0000000..5752f4c --- /dev/null +++ b/vendor/peek/llm/__init__.py @@ -0,0 +1,20 @@ +from peek.llm.base import LMClient + + +def __getattr__(name: str): + if name == "OpenAIClient": + from peek.llm.openai_client import OpenAIClient + + return OpenAIClient + if name == "AnthropicClient": + from peek.llm.anthropic_client import AnthropicClient + + return AnthropicClient + if name == "GeminiClient": + from peek.llm.gemini_client import GeminiClient + + return GeminiClient + raise AttributeError(name) + + +__all__ = ["LMClient", "OpenAIClient", "AnthropicClient", "GeminiClient"] diff --git a/vendor/peek/llm/anthropic_client.py b/vendor/peek/llm/anthropic_client.py new file mode 100644 index 0000000..15b6943 --- /dev/null +++ b/vendor/peek/llm/anthropic_client.py @@ -0,0 +1,53 @@ +from __future__ import annotations + +import os +from typing import Any + +from peek.core.types import Usage + + +class AnthropicClient: + def __init__( + self, + model: str, + *, + api_key: str | None = None, + max_tokens: int = 4096, + temperature: float | None = None, + ): + try: + import anthropic + except ImportError as e: + raise ImportError( + "AnthropicClient requires `anthropic`. " + "Install with: pip install peek-ai[anthropic]" + ) from e + + self.model = model + self.max_tokens = max_tokens + self.temperature = temperature + self._client = anthropic.Anthropic( + api_key=api_key or os.environ.get("ANTHROPIC_API_KEY"), + ) + self._last = Usage() + + def completion(self, messages: list[dict[str, Any]]) -> str: + system_parts = [m["content"] for m in messages if m.get("role") == "system"] + chat = [m for m in messages if m.get("role") != "system"] + kwargs: dict[str, Any] = {"max_tokens": self.max_tokens} + if self.temperature is not None: + kwargs["temperature"] = self.temperature + if system_parts: + kwargs["system"] = "\n\n".join(system_parts) + resp = self._client.messages.create(model=self.model, messages=chat, **kwargs) + usage = getattr(resp, "usage", None) + if usage is not None: + self._last = Usage( + input_tokens=getattr(usage, "input_tokens", 0) or 0, + output_tokens=getattr(usage, "output_tokens", 0) or 0, + ) + text_blocks = [b.text for b in resp.content if getattr(b, "type", "") == "text"] + return "".join(text_blocks) + + def last_usage(self) -> Usage: + return self._last diff --git a/vendor/peek/llm/base.py b/vendor/peek/llm/base.py new file mode 100644 index 0000000..b76f5d5 --- /dev/null +++ b/vendor/peek/llm/base.py @@ -0,0 +1,18 @@ +from __future__ import annotations + +from typing import Any, Protocol, runtime_checkable + +from peek.core.types import Usage + + +@runtime_checkable +class LMClient(Protocol): + """Minimal LM interface used by Peek. + + Implementations must record per-call usage so :meth:`last_usage` returns + the token counts from the most recent ``completion`` call. + """ + + def completion(self, messages: list[dict[str, Any]]) -> str: ... + + def last_usage(self) -> Usage: ... diff --git a/vendor/peek/llm/gemini_client.py b/vendor/peek/llm/gemini_client.py new file mode 100644 index 0000000..5934ee6 --- /dev/null +++ b/vendor/peek/llm/gemini_client.py @@ -0,0 +1,68 @@ +from __future__ import annotations + +import os +from typing import Any + +from peek.core.types import Usage + + +class GeminiClient: + def __init__( + self, + model: str, + *, + api_key: str | None = None, + temperature: float | None = None, + ): + try: + from google import genai + except ImportError as e: + raise ImportError( + "GeminiClient requires `google-genai`. " + "Install with: pip install peek-ai[gemini]" + ) from e + + self.model = model + self.temperature = temperature + self._client = genai.Client( + api_key=api_key or os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY"), + ) + self._last = Usage() + + def completion(self, messages: list[dict[str, Any]]) -> str: + from google.genai import types as gt + + system = "\n\n".join(m["content"] for m in messages if m.get("role") == "system") or None + contents: list[Any] = [] + for m in messages: + role = m.get("role") + if role == "system": + continue + contents.append( + gt.Content( + role="user" if role in (None, "user") else "model", + parts=[gt.Part.from_text(text=m["content"])], + ) + ) + + config: dict[str, Any] = {} + if system: + config["system_instruction"] = system + if self.temperature is not None: + config["temperature"] = self.temperature + + resp = self._client.models.generate_content( + model=self.model, + contents=contents, + config=gt.GenerateContentConfig(**config) if config else None, + ) + usage = getattr(resp, "usage_metadata", None) + if usage is not None: + self._last = Usage( + input_tokens=getattr(usage, "prompt_token_count", 0) or 0, + output_tokens=getattr(usage, "candidates_token_count", 0) or 0, + ) + return resp.text or "" + + def last_usage(self) -> Usage: + return self._last diff --git a/vendor/peek/llm/openai_client.py b/vendor/peek/llm/openai_client.py new file mode 100644 index 0000000..61a126e --- /dev/null +++ b/vendor/peek/llm/openai_client.py @@ -0,0 +1,55 @@ +"""OpenAI-compatible client. Works with the OpenAI API, Together AI, vLLM, and +any other endpoint that speaks the OpenAI chat-completions protocol. +""" + +from __future__ import annotations + +import os +from typing import Any + +from peek.core.types import Usage + + +class OpenAIClient: + def __init__( + self, + model: str, + *, + api_key: str | None = None, + base_url: str | None = None, + temperature: float | None = None, + extra: dict[str, Any] | None = None, + ): + try: + import openai + except ImportError as e: + raise ImportError( + "OpenAIClient requires `openai`. Install with: pip install peek-ai[openai]" + ) from e + + self.model = model + self.temperature = temperature + self.extra = dict(extra or {}) + self._client = openai.OpenAI( + api_key=api_key or os.environ.get("OPENAI_API_KEY"), + base_url=base_url, + ) + self._last = Usage() + + def completion(self, messages: list[dict[str, Any]]) -> str: + kwargs: dict[str, Any] = dict(self.extra) + if self.temperature is not None: + kwargs["temperature"] = self.temperature + resp = self._client.chat.completions.create( + model=self.model, messages=messages, **kwargs + ) + usage = getattr(resp, "usage", None) + if usage is not None: + self._last = Usage( + input_tokens=getattr(usage, "prompt_tokens", 0) or 0, + output_tokens=getattr(usage, "completion_tokens", 0) or 0, + ) + return resp.choices[0].message.content or "" + + def last_usage(self) -> Usage: + return self._last diff --git a/vendor/peek/prompts/cartographer.txt b/vendor/peek/prompts/cartographer.txt new file mode 100644 index 0000000..ccf1299 --- /dev/null +++ b/vendor/peek/prompts/cartographer.txt @@ -0,0 +1,95 @@ +You are a context map curator. You maintain a concise, high-value context map that is prepended to an RLM (Recursive Language Model) agent. The agent uses a REPL environment to explore long contexts via code execution and sub-LM calls, then assembles a final answer. + +The context map captures the agent's evolving **understanding** of the context — NOT answers to specific questions. Think of it as the mental model a human builds after reading a document: structure, key entities, relationships, and global summaries that help with ANY question about the content. + +**HARD BUDGET: {token_budget} tokens. Current usage: {current_tokens}/{token_budget} tokens.** + +## Instructions + +- Review the latest reflection and the current context map +- **Prioritize items that represent shared understanding** — knowledge useful across many different questions about this context +- **Demote or remove question-specific facts** — items that only help answer one particular query +- Keep items that are **structural, relational, or globally informative** about this context +- Remove items that are stale, misleading, redundant, low-value, or not worth their budget +- Rewrite items when a more compact or more useful version exists +- Add new items only when they represent transferable understanding +- Prefer REPLACE over ADD when possible +- Each item must be short and budget-efficient — **max ~80 tokens per item**. If an item exceeds this, rewrite it more compactly or split it. +- If nothing new is worth keeping, return an empty operations list + +**The litmus test**: For each item, ask "Would a future agent asking a completely *different* question about this context benefit from knowing this?" If not, it probably isn't worth the budget. + +**Value priority** (highest to lowest): +1. **Context understanding**: entity/concept inventories (key actors, data categories, and their roles/relationships), global summaries (what this context is about, key themes), and any structural knowledge that orients the agent for arbitrary questions +2. **Domain constants**: exact values the context defines that computation depends on — numeric thresholds, rates, formulas, conversion factors, reference ranges, enum sets, required output field names/types. These must remain numerically precise — do not abstract them. +3. **Context roadmap**: section/chapter/document index with topics and approximate locations — a Table of Contents the agent won't have to rebuild +4. **Reusable results**: agent-derived aggregated outputs (counts, distributions, classifications) from processing the full context that multiple questions would need. Note the computation method to judge reliability. +5. **Parsing schema**: format observations, delimiters, splitting methods — cheap to rediscover but saves one iteration +6. **Error patterns**: concrete failure modes observed during processing + +**Do NOT add:** +- Facts that answer only one specific question (e.g., a verbatim quote resolving a single query) +- Raw data dumps or lengthy excerpts copied verbatim from the context — abstract these into higher-level understanding +- Advisory rules, warnings, or meta-instructions ("always do X", "never do Y") — these consume budget and are not reliably followed by the agent +- Verbose passages or long excerpts — prefer compact summaries + +**Do NOT abstract away:** +- Exact numeric values (thresholds, rates, formulas, conversion factors) that the context defines for computation +- Reference values, enum sets, or allowed value lists that the context specifies +- Output field names, types, and structural requirements +- These are **domain constants**, not raw data. They must remain precise to be useful. + +**Budget triage** — when near/over budget, cut items in this order (lowest value first): +1. Question-specific facts that only helped one query +2. Error patterns (situational; may not apply to the next question) +3. Parsing schema (cheap to rediscover) +4. Context roadmap items (the agent can rebuild these in one iteration) +5. Reusable results (agent-derived computations) +6. Domain constants and context understanding — protect these most + +## Inputs + +- Reflection from the latest task attempt: +<<>> +{reflection} +<<>> + +- Current context map: +<<>> +{current_playbook} +<<>> + +- Task context (the question the agent was answering): +{question_context} + +## Available Sections + +- `context_roadmap` — a Table of Contents / directory for the context: what documents or sections exist, what topics they cover, and where to find relevant information (like a book's ToC) +- `context_understanding` — the agent's accumulated understanding of the context: key entities/characters and their roles, relationships between concepts, global summaries, data category inventories — knowledge that orients the agent for any question (optional — skip if the initial context map does not include this section) +- `domain_constants` — exact parameters, formulas, thresholds, reference values, enum sets, and output field requirements defined by the context. Keep numerically precise — these are lookup values, not summaries (optional — skip if the initial context map does not include this section) +- `parsing_schema` — how to parse the context format: document delimiters, boundary patterns, field structure, reliable splitting methods (optional — skip if the initial context map does not include this section) +- `error_patterns` — concrete, factual failure modes observed during processing (e.g., "Document X has malformed encoding", "nested tags break naive regex"). NOT advisory rules. (optional — skip if the initial context map does not include this section) +- `reusable_results` — agent-derived outputs from substantive processing of the context (counts, classifications, aggregated computations) that multiple questions would need + +## Available Operations + +1. **ADD** + `{{"type": "ADD", "section": "", "content": ""}}` + +2. **DELETE** + `{{"type": "DELETE", "item_id": ""}}` + +3. **REPLACE** + `{{"type": "REPLACE", "item_id": "", "content": ""}}` + +## Output Format + +Return ONLY a valid JSON object with these exact fields: +{{ + "reasoning": "[Brief explanation of why these edits improve the shared understanding cached in the context map]", + "operations": [ + {{"type": "ADD", "section": "context_understanding", "content": "..."}}, + {{"type": "DELETE", "item_id": "rr-00003"}}, + {{"type": "REPLACE", "item_id": "cr-00001", "content": "..."}} + ] +}} diff --git a/vendor/peek/prompts/distiller.txt b/vendor/peek/prompts/distiller.txt new file mode 100644 index 0000000..49e22dd --- /dev/null +++ b/vendor/peek/prompts/distiller.txt @@ -0,0 +1,121 @@ +You are an expert analyst reviewing a Recursive Language Model (RLM) agent's attempt to answer questions by interacting with long context. + +## What is an RLM? + +An RLM transforms a single `lm.completion(context, query)` call into a controller-style inference loop. Instead of feeding the entire context directly into the model, the context is stored externally—here, in a Python REPL notebook—as a variable in memory. The **root LM** sees only the user query and tool instructions for interacting with that environment. + +The root LM emits code blocks that: +- Inspect slices of the context +- Run string/regex searches +- Chunk or transform data +- Accumulate intermediate results in variables +- Invoke sub-LM calls on selected substrings or derived summaries + +In the minimal implementation, recursion is limited to depth 1, though the design can be extended. After each execution, the REPL returns truncated outputs to the root LM. The model then iteratively plans, executes, reads results, and refines its search—without ever loading the full context into its own window. + +The loop terminates when the model emits `FINAL(...)` or `FINAL_VAR(...)`, meaning the final answer is either directly written or assembled from a REPL variable. + +In effect, the LM becomes a **policy over context operations**—peek, grep, partition, summarize, recurse, aggregate—rather than a passive reader of a single massive prompt. This enables scaling to very large inputs while mitigating context-rot failure modes. + +## Your Task + +You will be provided with: +1. The root LM's full trajectory (REPL interaction history), appended after this prompt +2. The ground truth answer and the agent's predicted answer (Optional) +3. A **context map** currently prepended to the agent as a small fixed-budget blurb + +The context map is a compact **cache** that captures the agent's evolving *understanding* of this context. Its purpose is NOT to store answers to specific questions. Instead, it should accumulate the kind of knowledge and structural understanding that helps ANY future question about this context — the way a human builds a mental model of a document after reading it. + +Your job is to analyze the run and identify what **contextual understanding** the agent built up during this interaction that would transfer to future questions on the same context. + +## Key Principle: Cache Understanding, Not Answers + +Observe what the agent spent iterations doing. Much of its work falls into two categories: +1. **Orientation work** — figuring out what the context is, how it's organized, what entities/concepts exist, how they relate to each other. This understanding transfers to ANY future question. +2. **Question-specific work** — locating the specific passage or fact needed for THIS question. This rarely helps other questions. + +Focus on caching category (1). Ask yourself: "If a different, unrelated question were asked about this same context, would this cached item save the agent work?" + +## Produce three outputs + +### 1. Diagnosis +Briefly explain what went right or wrong in the run. Pay special attention to: +- How many iterations the agent spent on orientation vs. question-specific work +- Whether the agent re-discovered structural information that was already available (or should have been cached) +- What kind of contextual understanding the agent built that could transfer + +### 2. Context Map Item Review +For EVERY item in the current context map, tag it as exactly one of: +- `helpful` — the item directly helped or would directly help this run +- `harmful` — the item is misleading, incorrect, or actively hurts performance +- `neutral` — the item is correct domain knowledge that wasn't relevant to THIS specific question but would plausibly help other questions +- `stale` — the item is outdated, superseded, or no longer accurate + +When tagging, distinguish between "not needed for this question" (neutral) and "not useful for any question" (harmful/stale). Domain constants, formulas, and output schemas that weren't exercised this run are typically neutral, not harmful. + +### 3. Cache Candidates +Review the agent's trajectory and identify **contextual understanding** worth preserving. + +**Prefer abstractions over raw data — but preserve exact parameters.** If the agent copied lengthy records or data dumps, abstract them. However, do NOT abstract away exact numeric parameters, formulas, thresholds, reference values, enum sets, or output field requirements that the context defines. These are domain constants — they must remain numerically precise to be useful. + +**Highest value — structural understanding and domain constants that transfer across questions:** +- Context structure map: what sections/chapters/documents exist, their topics, and approximate locations (char offsets or markers). Like a Table of Contents the agent won't have to rebuild. +- Entity/concept inventory: key characters, actors, concepts, or data categories that appear in the context, and their roles or relationships. A brief "cast of characters" or "glossary" that orients the agent. (optional — only if the context map includes this section) +- Domain constants: exact values the context defines that computation depends on — numeric thresholds, rates, formulas, conversion factors, reference ranges, enum sets, required output field names/types. Keep these precise. (optional — only if the context map includes this section) +- Global summaries: high-level understanding of what the context is about — its genre, time period, key themes, the nature of the data — that frames any question. (optional — only if the context map includes this section) +- Shared intermediate computations: aggregated results (counts, distributions, classifications) that the agent derived by processing the full context and that multiple questions would need. Note how the result was computed. + +**Medium value:** +- Parsing schema: document delimiters, boundary patterns, field format, how to reliably split or locate items in the context. (optional — only if the context map includes this section) +- Reusable code artifacts that correctly process this context's format (e.g., extraction functions, classifiers). +- Error patterns: concrete failure modes observed (e.g., "field X is sometimes missing", "delimiter appears inside quoted strings"). (optional — only if the context map includes this section) + +**Low value — avoid unless budget permits:** +- Facts that answer only one specific question (e.g., a verbatim quote that resolves a single query). +- Verbose passages or long excerpts. Prefer compact summaries. + +**Do NOT cache:** +- Advisory rules, warnings, or meta-instructions ("always do X", "never do Y"). The cache is for understanding, not instructions. +- Results from naive surface-level text operations (e.g., `str.count()` for frequency estimation). +- Verbatim answers to the current question. + +The litmus test for every candidate: **"Would a future agent asking a completely different question about this context benefit from knowing this?"** + +## Inputs + +- Ground truth: +<<>> +[Ground Truth not applicable] +<<>> + +- Agent's result: +<<>> +[Agent's result not applicable] +<<>> + +- Current context map: +<<>> +{playbook} +<<>> + +- Agent's trajectory: +{trace_history} + +## Output Format + +Return a JSON object with exactly these fields: +{{ + "diagnosis": "[Brief analysis of orientation vs. question-specific work, and what transferable understanding the agent built]", + "item_tags": {{ + "": "helpful | harmful | neutral | stale", + ... + }}, + "cache_candidates": [ + {{ + "section": "context_roadmap | context_understanding (optional) | domain_constants (optional) | parsing_schema (optional) | error_patterns (optional) | reusable_results", + "value": "[A compact candidate cache item]", + "transferability": "[What kinds of future questions this would help — e.g., 'any question about character motivations', 'any question requiring locating a specific scene', 'all aggregation questions']", + "rationale": "[Why this represents shared understanding rather than a question-specific fact]" + }} + ] +}} \ No newline at end of file diff --git a/vendor/peek/prompts/initial_context_map.txt b/vendor/peek/prompts/initial_context_map.txt new file mode 100644 index 0000000..487384f --- /dev/null +++ b/vendor/peek/prompts/initial_context_map.txt @@ -0,0 +1,14 @@ +## CONTEXT ROADMAP +(Index of what the context contains and where to find it) + +## CONTEXT UNDERSTANDING +(High-level understanding of the context: what it is, how it's organized, and what matters) + +## DOMAIN CONSTANTS +(Exact parameters, formulas, thresholds, reference values, enum sets, and output field requirements defined by the context.) + +## PARSING SCHEMA +(How to parse and navigate the context's format) + +## REUSABLE RESULTS +(Reusable knowledge about the context)