explore(agent-wiki): trajectory-derived wiki — skills, builder, experiments by vinodmut · Pull Request #268 · AgentToolkit/altk-evolve

vinodmut · 2026-06-10T05:54:54Z

Related to #256 — this is a prototype of offline trajectory-mining + consolidation ("dreaming"): reviewing saved trajectories to extract, consolidate, deduplicate, and curate memory outside the main task loop, with an auditable record of what changed.

What this is

An exploration in turning agent trajectories into a reusable, evidence-grounded wiki that future agents consult before acting — plus the experiments measuring whether it helps. Everything lives self-contained under explorations/agent-wiki/.

The core idea: after an agent finishes a task, distill its trajectory into wiki pages — episodic summaries, atomic guidelines, themed cluster pages, and executable skills — each linked back to the trajectory that produced it. A future agent, pointed at the wiki's AGENTS.md, retrieves the pages relevant to its task and applies them instead of re-deriving the recipe.

How this maps to #256 ("dreaming")

#256 asks for	provided here
extract useful memories from raw trajectories after the fact	`agent-wiki-summarize` / `-extract-guidelines` / `-synthesize-skill` (retroactive + batch ingest)
consolidate duplicate / overlapping guidelines	`agent-wiki-consolidate-guidelines` → cluster pages
promote repeated observations; detect stale / redundant entities	delete-on-promote (`--archive-covered`), recall roll-up, priority tiers
auditable summary of what changed and why	`_audit.log` + provenance back-links on every page

Layout

explorations/agent-wiki/
├── skills/        7 agent-wiki skills + build_agent_wiki.py (reference copy)
├── docs/          design.md (rationale) + schema.md (on-disk format)
├── experiments/   RESULTS-SUMMARY + comparison reports; metrics/ rollups; harness/ scripts
└── wikis/         worked examples: wiki-twobatch {base, skills, both, pruned}

Headline findings (`experiments/RESULTS-SUMMARY.md`)

Wiki vs no wiki: −20% cost, −38% duration, −43% tool calls at unchanged accuracy (16-task A/B).
Skills > guidelines: a skills-only wiki beats a guidelines-only one on cost (−14%) and matches accuracy.
Pointer wording is load-bearing: a strong-imperative CLAUDE.md pointer is read 3/3; a soft one 1/3.
Composition > size: piling guidelines on top of skills is the worst populated wiki; delete-on-promote (archive skill-covered atomics) beats it but skills-only stays cheapest.

Scope / data note

These are benchmark-derived example wikis (a synthetic 16-task file-format corpus). Raw per-trial sandbox transcripts and any wikis built from internal trajectory corpora are intentionally excluded — only metric rollups, narrative reports, and the benchmark-derived wikis are included. Source links in wiki frontmatter are shown in the generic trajectories/<session-id>.json form. The skills are a standalone reference copy, not wired into a plugin loader.

Summary by CodeRabbit

New Features
- Introduced agent-wiki exploration: a system for converting agent task trajectories into curated, evidence-grounded wikis containing skills, guidelines, and episode summaries for knowledge reuse and improved task performance.
Documentation
- Added comprehensive design documents, schema references, and usage guides for agent-wiki workflows, including experimental results from multiple wiki retrieval strategies.
Chores
- Updated linting and secret-scanning configurations to exclude generated wiki content from checks.

Adds explorations/agent-wiki/ — the agent-wiki skill family, builder, design + schema docs, the wiki-helps experiment reports, and benchmark-derived example wikis, all under one tree suitable for a public PR. Contents: - skills/ 7 agent-wiki skills + build_agent_wiki.py (reference copy, not plugin-wired) - docs/ design.md + schema.md - experiments/ RESULTS-SUMMARY + twobatch comparison reports + pruned-index-hypothesis; metrics/ rollups (no raw transcripts); harness/ runner + compare scripts - wikis/ wiki-terminalbench-bob + the twobatch arms (base / skills / both / pruned-corrected) Public-safety scrub: - Excluded all raw per-trial sandbox transcripts (kept only metric rollups + narrative reports). - Excluded wikis built from internal corpora (procedural-design, consult-meta, iterative, retroactive, simple-claude, test-paired, claude) and the build-pattern comparison that ran on them; §3-4 of RESULTS-SUMMARY reduced to a portable-finding note. - Rewrote all source-path frontmatter to the generic trajectories/<session-id>.json form; genericized internal example names and the benchmark-data dir convention in skills/docs. - Leak gate (benchmark-data / internal corpus + wiki names / org paths) passes with zero hits across the tree. Branched off main; diff touches only explorations/agent-wiki/. Builder catalog + comparison scripts verified runnable from the new location.

Removes the terminal-bench example wiki from the exploration. Repoints the README reading-order + layout to wiki-twobatch-skills, fixes the docs that attributed worked examples to it (schema.md now points at the wiki-twobatch arms; example index rows retagged), and corrects stale relative links the docs carried from the original tree (../plugin-source → ../skills, ../WIKIS.md removed, ../experiments/wiki-build-comparison.md → RESULTS-SUMMARY §3–4, design.md/schema.md cross-links to renamed filenames). Skill example paths (consult, ingest) repointed off the removed wiki. Remaining wikis: wiki-twobatch {base, skills, both, pruned}. All intra-doc relative links resolve; leak gate clean.

coderabbitai · 2026-06-10T05:55:03Z

📝 Walkthrough

Walkthrough

This PR introduces the agent-wiki exploration framework: a system for distilling agent trajectories into a reusable, evidence-grounded knowledge base comprising episodic summaries, atomic guidelines, consolidated clusters, executable skills, and cross-task comparisons. It includes comprehensive design/schema documentation, multi-arm experimental validation across 17 tasks and 8 experiments, and operational skills guiding end-to-end ingestion and querying workflows, plus configuration updates excluding the exploration from lint/scan tooling.

Changes

Agent-Wiki Framework Design & Schema

Layer / File(s)	Summary
Framework Overview & Rationale `explorations/agent-wiki/README.md`	High-level introduction explaining the exploration goal (post-task distillation into linked pages), repository layout (skills/docs/experiments/wikis), recommended reading order, and scope constraints (benchmark-derived synthetic corpus, no raw transcripts).
Design Principles & Architecture `explorations/agent-wiki/docs/design.md`	Specifies mandatory provenance, five page kinds with retrieval preference order, guidance on procedural vs declarative structure, consolidation and delete-on-promote lifecycle, recall-time discipline, end-to-end build pipeline with execution/ordering constraints between passes, and experimental evidence summary with cross-experiment findings.
On-Disk Schema & Contracts `explorations/agent-wiki/docs/schema.md`	Complete reference documenting directory layout, filename suffixes, YAML frontmatter for each page kind, index/config/audit artifact structures, linking rules (forward authored, reverse recomputed), promotion/archival lifecycle mechanics with worked examples, and coverage inference for skill-driven archival.

Experimental Validation & Result Analysis

Layer / File(s)	Summary
Experiment Harness & Task Suite `explorations/agent-wiki/experiments/harness/experiment_wiki_consult.py`, `wiki_consult_tasks.yaml`	Main orchestrator creating per-trial workspaces, executing Dockerized agent sessions against wikis, parsing stream-json output to detect AGENTS.md and guideline access, and computing outcome matches. Includes 17 prompt-driven tasks covering EXIF/image metadata, codebase exploration, format parsing, and archival/audio analysis, all with explicit outcome matching rules.
Trial Metrics Extraction & Transcript Normalization `explorations/agent-wiki/experiments/harness/extract_trial_metrics.py`, `normalize_stream_json_transcripts.py`	Extracts per-trial metrics (token usage, tool calls, wiki reads) from stream-json transcripts and normalizes diverse transcript formats into OpenAI chat-completion style JSON for downstream analysis.
Result Aggregation & Comparison Reporting `explorations/agent-wiki/experiments/harness/twobatch_compare.py`, `threeway_compare.py`, `fourway_compare.py`, `fiveway_compare.py`	Modular scripts reading JSONL metrics, grouping by task/arm, computing statistics (accuracy, cost, tokens), and rendering Markdown comparison reports with per-family and per-task breakdowns.
Experimental Findings & Reports `explorations/agent-wiki/experiments/RESULTS-SUMMARY.md`, `twobatch-comparison.md`, `twobatch-skills-comparison.md`, `twobatch-fourway-comparison.md`, `twobatch-fiveway-comparison.md`, `pruned-index-hypothesis.md`, `explorations/agent-wiki/experiments/metrics/*`	Comprehensive markdown reports documenting eight multi-arm experiments comparing empty wiki, guidelines-only, skills-only, combined ("both"), and pruned (delete-on-promote) arms. Includes consolidated findings summary, per-experiment write-ups with aggregate/family/per-task tables, metric JSONL data (48–95 records per file), and root-cause analysis of index staleness confound.

Operational Skills & Configuration

Layer / File(s)	Summary
Wiki Consultation Skill `explorations/agent-wiki/skills/agent-wiki-consult/SKILL.md`	Defines how agents consult the wiki: resolving wiki root, reading AGENTS.md navigation contract and _index.jsonl index, applying retrieval recipes, and presenting 2–5 ranked guideline matches with rationale.
Guideline Extraction & Skill Synthesis `explorations/agent-wiki/skills/agent-wiki-extract-guidelines/SKILL.md`, `agent-wiki-synthesize-skill/SKILL.md`	Workflows for extracting reusable guideline entities from trajectories (with JSON schema for error/root-cause identification and artifact capture) and promoting successful workflows into executable skill pages (with JSON schema for skill definition, validation, and file/index/audit updates).
Summarization, Consolidation & Task Comparison `explorations/agent-wiki/skills/agent-wiki-summarize/SKILL.md`, `agent-wiki-consolidate-guidelines/SKILL.md`, `agent-wiki-tasks/SKILL.md`, `agent-wiki-ingest/SKILL.md`	Four interconnected skills: summarizing trajectories into episodic pages with recalled-guideline status, consolidating atomic guidelines into themed clusters, comparing cross-trajectory task families with evidence-grounded findings and subtask decomposition, and orchestrating the full pipeline from raw traces through normalization, per-trace processing, consolidation, and catalog refresh.
Configuration Defaults & Templates `explorations/agent-wiki/skills/scripts/_default_agent_wiki_config.yaml`, `_default_agents.md`	Default YAML configuration schema (tags, clusters, tasks, session overrides) and template AGENTS.md contract specifying wiki consultation timing, directory/suffix conventions, retrieval algorithms, provenance chaining, and bootstrap/wrapper behavior.

Repository Tooling & Configuration

Layer / File(s)	Summary
Repository Configuration Updates `.pre-commit-config.yaml`, `.secrets.baseline`, `pyproject.toml`	Excludes `explorations/agent-wiki/` from detect-secrets scanning (with comments explaining example IDs trigger high-entropy detection), updates .secrets.baseline with adjusted line numbers and file exclusions, and updates Ruff/MyPy in pyproject.toml to exclude generated example-wiki content from linting and type checking.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

visahak
gaodan-fang

Poem

🐰 A wiki grows from wandering paths,
Each agent's journey now distilled in haste.
Guidelines gleam, and skills take flight—
One bunny's trace becomes many's light! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main addition: a new exploration introducing agent-wiki with skills, builder, and experiments.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CI (ruff, mypy, detect-secrets) was scanning explorations/agent-wiki/ as project source — the first content under explorations/ to carry .py files and high-entropy identifiers. Fixes, scoped so generated example artifacts are treated like the already-excluded plugin-source/ and examples/ trees: - ruff: lint + format fixes in the harness scripts + builder; exclude the generated wiki scripts (explorations/agent-wiki/wikis/) via extend-exclude. - mypy: add explorations/agent-wiki/wikis/ to exclude; add file-local `# mypy: ignore-errors` to the exploration harness + the builder (a verbatim copy of the mypy-excluded plugin-source/ original). - detect-secrets: exclude explorations/agent-wiki/ in the pre-commit hook and .secrets.baseline — the 53 findings are 12-hex guideline content hashes and session-id UUIDs, not secrets. No example-wiki content changed (scripts keep their original names). Fixes failing CI checks: check-formatting, check-linting, check-typing, tekton/pr-code-checks/code-detect-secrets.

Drops explorations/agent-wiki/wikis/ (253 generated files, ~10k lines) from this PR so the diff is the reviewable surface — skills, builder, docs, and the experiment reports/harness (~34 files). The example wikis are machine- generated output; bundling them buried the code and appears to have made CodeRabbit skip deep review (summary only, zero inline findings). The wikis land in a stacked follow-up PR. README/docs still reference wikis/wiki-twobatch-* by path; those links resolve once the follow-up merges. Root-config excludes (ruff/mypy/detect-secrets) are kept — the detect-secrets exclude still covers example content hashes in docs/schema.md, and the wiki excludes become live again when the follow-up lands.

vinodmut · 2026-06-10T07:07:45Z

Split the generated example wikis into a companion PR #269 (merge after this one) so this diff stays focused on the reviewable code — builder, skills, docs, and experiment harness (34 files vs the original 287). This should let CodeRabbit review the code properly.

vinodmut · 2026-06-10T07:09:19Z

@coderabbitai review

coderabbitai · 2026-06-10T07:09:27Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (13)

explorations/agent-wiki/skills/agent-wiki-synthesize-skill/SKILL.md (1)

205-213: ⚡ Quick win

Add language specifier to fenced code block.

The directory structure example should use text or similar language identifier for consistency.

📝 Suggested fix

-```
+```text
 <wiki>/skills/
 ├── _id_index.json                     skill slug → relpath
 ├── index.md                           alphabetical listing (auto-generated)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@explorations/agent-wiki/skills/agent-wiki-synthesize-skill/SKILL.md` around
lines 205 - 213, Update the fenced code block in SKILL.md that shows the
directory tree for "<wiki>/skills/" to include a language specifier (e.g.,
change the opening ``` to ```text) so the block is marked as plain text; locate
the block in the SKILL.md content that begins with the three backticks followed
by the tree and replace the opening fence accordingly to ensure consistent
formatting.

Source: Linters/SAST tools

explorations/agent-wiki/skills/agent-wiki-consult/SKILL.md (2)

53-55: ⚡ Quick win

Add language specifier to fenced code block.

The code block should specify bash as the language for proper syntax highlighting and consistency with the rest of the documentation.

📝 Suggested fix

-```
+```bash
 Read <wiki-root>/AGENTS.md

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>