diff --git a/CHANGELOG.md b/CHANGELOG.md index b48a7fd..6e2de29 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,32 +2,6 @@ ## Unreleased -## [2.1.0](https://github.com/PatrickSys/codebase-context/compare/v1.9.0...v2.1.0) (2026-04-13) - -### Features - -- **search:** surface chunk intelligence directly in `search_codebase` results, including symbol identity, scope, signature preview, and compact/full response budgeting -- **map:** upgrade the conventions map with structural skeleton sections and add `map --export` so the compact map can be written to `CODEBASE_MAP.md` -- **mcp:** rework multi-project routing so one MCP server can serve multiple projects instead of one hardcoded server entry per repo -- **mcp:** keep explicit `project` as the fallback when the client does not provide enough project context -- **mcp:** accept repo paths, subproject paths, and file paths as `project` selectors when routing is ambiguous - -### Bug Fixes - -- **metadata:** require real dependency evidence plus multiple framework indicators before labeling a repo as Next.js or another specialized framework -- **reranker:** auto-heal corrupted cross-encoder cache entries and surface degraded reranker state in `searchQuality.rerankerStatus` -- **benchmarks:** harden comparator lanes for cross-platform execution and keep setup failures explicit instead of silently turning them into claims -- **search:** auto-heal on corrupted index now triggers a background rebuild instead of blocking the search response - -### Documentation - -- publish the v2.1.0 discovery benchmark rerun with the current gate output: `pending_evidence`, `claimAllowed: false`, `24` frozen tasks, `0.75` average usefulness, and `1822.25` average estimated tokens -- document the current comparator truth instead of stale assumptions: the public proof still has setup failures plus near-empty comparator outputs on this host, so benchmark win claims remain blocked -- note the new `searchQuality.tokenEstimate` advisory contract: estimates are based on the final serialized response payload and warnings only appear above the 4K-token threshold -- simplify the setup story around a roots-first contract: roots-capable multi-project sessions, single-project fallback, and explicit `project` retries -- clarify that issue #63 fixed the architecture and workspace-aware workflow, but issue #2 is still only partially solved when the client does not provide roots or active-project context -- remove the repo-local `init` / marker-file story from the public setup guidance - ## [1.9.0](https://github.com/PatrickSys/codebase-context/compare/v1.8.2...v1.9.0) (2026-03-19) diff --git a/README.md b/README.md index b8c060a..9ec1d4c 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ Here's what codebase-context does: One tool call returns all of it. Local-first - your code never leaves your machine by default. -See the [v2.0.0 benchmark](./docs/benchmark.md) for the discovery suite results and current gate truth. +See the [current discovery benchmark](./docs/benchmark.md) for the checked-in proof results and current gate truth. ### What it looks like @@ -224,7 +224,7 @@ These are the behaviors that make the most difference day-to-day. Copy, trim wha ## Links -- [Benchmark](./docs/benchmark.md) — v2.0.0 discovery suite results and gate truth +- [Benchmark](./docs/benchmark.md) — current discovery suite results and gate truth - [Demo](./docs/demo.md) — real CLI walkthrough - [Client Setup](./docs/client-setup.md) — per-client config, HTTP setup, local build testing - [Capabilities Reference](./docs/capabilities.md) — tool API, retrieval pipeline, decision card schema diff --git a/docs/benchmark.md b/docs/benchmark.md index 1ea836e..4b9a5e8 100644 --- a/docs/benchmark.md +++ b/docs/benchmark.md @@ -1,6 +1,6 @@ # Discovery Benchmark -This page documents the current public proof slice for `v2.0.0`. +This page documents the current public discovery proof from the checked-in result artifacts on `master`. It is a discovery benchmark, not an implementation-quality benchmark. ## Scope @@ -37,48 +37,44 @@ From `results/gate-evaluation.json`: - `claimAllowed`: `false` - `totalTasks`: `24` - `averageUsefulness`: `0.75` -- `averageEstimatedTokens`: `903.7083333333334` +- `averageEstimatedTokens`: `1822.25` - `bestExampleUsefulnessRate`: `0.125` Repo-level outputs from the same rerun: | Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness | | --- | ---: | ---: | ---: | ---: | -| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 | -| `excalidraw` | 12 | 0.6667 | 726.75 | 0 | +| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 | +| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 | ## Gate Truth The gate is intentionally still blocked. -- The combined suite now covers both public repos. -- The release claim is still disallowed because comparator evidence remains incomplete. -- Missing evidence currently includes: - - raw Claude Code baseline metrics - - GrepAI metrics - - jCodeMunch metrics - - codebase-memory-mcp metrics - - CodeGraphContext metrics +- The combined suite covers both public repos. +- `claimAllowed` remains `false` because comparator evidence still does not support a benchmark-win claim. +- Two comparator lanes now return `status: "ok"`, but both are effectively near-empty on the frozen tasks and contribute `0` average usefulness. +- Three comparator lanes still fail setup entirely. ## Comparator Reality -The current comparator artifact records setup failures, not benchmark wins. +The current comparator artifact records incomplete comparator evidence, not benchmark wins. | Comparator | Status | Current reason | | --- | --- | --- | -| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer | -| `jCodeMunch` | `setup_failed` | MCP server closes during startup | +| `codebase-memory-mcp` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and `5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence | +| `jCodeMunch` | `setup_failed` | `MCP error -32000: Connection closed` | | `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present | -| `CodeGraphContext` | `setup_failed` | MCP server closes during startup | -| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment | +| `CodeGraphContext` | `setup_failed` | `MCP error -32000: Connection closed` | +| `raw Claude Code` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and only `18.5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence | -`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. +`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. ## Important Limitations - This benchmark measures discovery usefulness and payload cost only. - It does not measure implementation correctness, patch quality, or end-to-end task completion. -- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`. +- Comparator setup remains environment-sensitive, and the checked-in comparator outputs are still too weak to justify a claim. - The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness. - `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set. diff --git a/docs/capabilities.md b/docs/capabilities.md index 67b5ef1..03db7ca 100644 --- a/docs/capabilities.md +++ b/docs/capabilities.md @@ -75,7 +75,7 @@ Shared selector inputs: | Tool | Input | Output | | ----------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` + decision card (`ready`, `nextAction`, `patterns`, `bestExample`, `impact`, `whatWouldHelp`) when `intent="edit"`. Hints capped at 3 per category. | +| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Compact mode returns a bounded result set with `file`, `summary`, `score`, lightweight structural metadata (`symbol`, `symbolKind`, `scope`, `signaturePreview`), and `searchQuality` (`status`, `confidence`, optional `hint`, `tokenEstimate`, `warning`, `rerankerStatus`). Full mode adds richer relationships plus chunk-level `imports`, `exports`, and `complexity`. When `intent="edit"`, a decision card is returned with `ready`, `patterns`, `bestExample`, `impact`, and `whatWouldHelp`. | | `get_team_patterns` | optional `category`, shared `project`/`project_directory` | Pattern frequencies, trends, golden files, conflicts | | `get_symbol_references` | `symbol`, optional `limit`, shared `project`/`project_directory` | Concrete symbol usage evidence: `usageCount` + top usage snippets + `confidence` + `isComplete`. `confidence: "syntactic"` means static/source-based only (no runtime or dynamic dispatch). When Tree-sitter + file content are available, comments and string literals are excluded from the scan — the count reflects real identifier nodes only. Replaces the removed `get_component_usage`. | | `remember` | `type`, `category`, `memory`, `reason`, shared `project`/`project_directory` | Persists to `.codebase-context/memory.json` | @@ -184,6 +184,8 @@ Ordered by execution: 9. **Symbol-level deduplication** — within each `symbolPath` group, keep only the highest-scoring chunk (prevents duplicate methods from same class clogging results). 10. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded. 11. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), scope header for symbol-aware snippets (`// ClassName.methodName`), related memories (capped to 3), search quality assessment with `hint` when low confidence. +12. **Payload budgeting** — final serialized search responses include `searchQuality.tokenEstimate`; warnings only appear above the 4K-token threshold and differ between compact and full mode. +13. **Full-mode chunk metadata** — when available, full-mode results surface chunk-level `imports` (top 5), `exports` (top 5), and cyclomatic `complexity`. ### Defaults