Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 0 additions & 26 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,6 @@

## Unreleased

## [2.1.0](https://github.com/PatrickSys/codebase-context/compare/v1.9.0...v2.1.0) (2026-04-13)

### Features

- **search:** surface chunk intelligence directly in `search_codebase` results, including symbol identity, scope, signature preview, and compact/full response budgeting
- **map:** upgrade the conventions map with structural skeleton sections and add `map --export` so the compact map can be written to `CODEBASE_MAP.md`
- **mcp:** rework multi-project routing so one MCP server can serve multiple projects instead of one hardcoded server entry per repo
- **mcp:** keep explicit `project` as the fallback when the client does not provide enough project context
- **mcp:** accept repo paths, subproject paths, and file paths as `project` selectors when routing is ambiguous

### Bug Fixes

- **metadata:** require real dependency evidence plus multiple framework indicators before labeling a repo as Next.js or another specialized framework
- **reranker:** auto-heal corrupted cross-encoder cache entries and surface degraded reranker state in `searchQuality.rerankerStatus`
- **benchmarks:** harden comparator lanes for cross-platform execution and keep setup failures explicit instead of silently turning them into claims
- **search:** auto-heal on corrupted index now triggers a background rebuild instead of blocking the search response

### Documentation

- publish the v2.1.0 discovery benchmark rerun with the current gate output: `pending_evidence`, `claimAllowed: false`, `24` frozen tasks, `0.75` average usefulness, and `1822.25` average estimated tokens
- document the current comparator truth instead of stale assumptions: the public proof still has setup failures plus near-empty comparator outputs on this host, so benchmark win claims remain blocked
- note the new `searchQuality.tokenEstimate` advisory contract: estimates are based on the final serialized response payload and warnings only appear above the 4K-token threshold
- simplify the setup story around a roots-first contract: roots-capable multi-project sessions, single-project fallback, and explicit `project` retries
- clarify that issue #63 fixed the architecture and workspace-aware workflow, but issue #2 is still only partially solved when the client does not provide roots or active-project context
- remove the repo-local `init` / marker-file story from the public setup guidance

## [1.9.0](https://github.com/PatrickSys/codebase-context/compare/v1.8.2...v1.9.0) (2026-03-19)


Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Here's what codebase-context does:

One tool call returns all of it. Local-first - your code never leaves your machine by default.

See the [v2.0.0 benchmark](./docs/benchmark.md) for the discovery suite results and current gate truth.
See the [current discovery benchmark](./docs/benchmark.md) for the checked-in proof results and current gate truth.

### What it looks like

Expand Down Expand Up @@ -224,7 +224,7 @@ These are the behaviors that make the most difference day-to-day. Copy, trim wha

## Links

- [Benchmark](./docs/benchmark.md) — v2.0.0 discovery suite results and gate truth
- [Benchmark](./docs/benchmark.md) — current discovery suite results and gate truth
- [Demo](./docs/demo.md) — real CLI walkthrough
- [Client Setup](./docs/client-setup.md) — per-client config, HTTP setup, local build testing
- [Capabilities Reference](./docs/capabilities.md) — tool API, retrieval pipeline, decision card schema
Expand Down
34 changes: 15 additions & 19 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Discovery Benchmark

This page documents the current public proof slice for `v2.0.0`.
This page documents the current public discovery proof from the checked-in result artifacts on `master`.
It is a discovery benchmark, not an implementation-quality benchmark.

## Scope
Expand Down Expand Up @@ -37,48 +37,44 @@ From `results/gate-evaluation.json`:
- `claimAllowed`: `false`
- `totalTasks`: `24`
- `averageUsefulness`: `0.75`
- `averageEstimatedTokens`: `903.7083333333334`
- `averageEstimatedTokens`: `1822.25`
- `bestExampleUsefulnessRate`: `0.125`

Repo-level outputs from the same rerun:

| Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness |
| --- | ---: | ---: | ---: | ---: |
| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 |
| `excalidraw` | 12 | 0.6667 | 726.75 | 0 |
| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 |
| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 |

## Gate Truth

The gate is intentionally still blocked.

- The combined suite now covers both public repos.
- The release claim is still disallowed because comparator evidence remains incomplete.
- Missing evidence currently includes:
- raw Claude Code baseline metrics
- GrepAI metrics
- jCodeMunch metrics
- codebase-memory-mcp metrics
- CodeGraphContext metrics
- The combined suite covers both public repos.
- `claimAllowed` remains `false` because comparator evidence still does not support a benchmark-win claim.
- Two comparator lanes now return `status: "ok"`, but both are effectively near-empty on the frozen tasks and contribute `0` average usefulness.
- Three comparator lanes still fail setup entirely.

## Comparator Reality

The current comparator artifact records setup failures, not benchmark wins.
The current comparator artifact records incomplete comparator evidence, not benchmark wins.

| Comparator | Status | Current reason |
| --- | --- | --- |
| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer |
| `jCodeMunch` | `setup_failed` | MCP server closes during startup |
| `codebase-memory-mcp` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and `5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |
| `jCodeMunch` | `setup_failed` | `MCP error -32000: Connection closed` |
| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present |
| `CodeGraphContext` | `setup_failed` | MCP server closes during startup |
| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment |
| `CodeGraphContext` | `setup_failed` | `MCP error -32000: Connection closed` |
| `raw Claude Code` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and only `18.5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |

`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.

## Important Limitations

- This benchmark measures discovery usefulness and payload cost only.
- It does not measure implementation correctness, patch quality, or end-to-end task completion.
- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`.
- Comparator setup remains environment-sensitive, and the checked-in comparator outputs are still too weak to justify a claim.
- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness.
- `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set.

Expand Down
4 changes: 3 additions & 1 deletion docs/capabilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Shared selector inputs:

| Tool | Input | Output |
| ----------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` + decision card (`ready`, `nextAction`, `patterns`, `bestExample`, `impact`, `whatWouldHelp`) when `intent="edit"`. Hints capped at 3 per category. |
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Compact mode returns a bounded result set with `file`, `summary`, `score`, lightweight structural metadata (`symbol`, `symbolKind`, `scope`, `signaturePreview`), and `searchQuality` (`status`, `confidence`, optional `hint`, `tokenEstimate`, `warning`, `rerankerStatus`). Full mode adds richer relationships plus chunk-level `imports`, `exports`, and `complexity`. When `intent="edit"`, a decision card is returned with `ready`, `patterns`, `bestExample`, `impact`, and `whatWouldHelp`. |
| `get_team_patterns` | optional `category`, shared `project`/`project_directory` | Pattern frequencies, trends, golden files, conflicts |
| `get_symbol_references` | `symbol`, optional `limit`, shared `project`/`project_directory` | Concrete symbol usage evidence: `usageCount` + top usage snippets + `confidence` + `isComplete`. `confidence: "syntactic"` means static/source-based only (no runtime or dynamic dispatch). When Tree-sitter + file content are available, comments and string literals are excluded from the scan — the count reflects real identifier nodes only. Replaces the removed `get_component_usage`. |
| `remember` | `type`, `category`, `memory`, `reason`, shared `project`/`project_directory` | Persists to `.codebase-context/memory.json` |
Expand Down Expand Up @@ -184,6 +184,8 @@ Ordered by execution:
9. **Symbol-level deduplication** — within each `symbolPath` group, keep only the highest-scoring chunk (prevents duplicate methods from same class clogging results).
10. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded.
11. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), scope header for symbol-aware snippets (`// ClassName.methodName`), related memories (capped to 3), search quality assessment with `hint` when low confidence.
12. **Payload budgeting** — final serialized search responses include `searchQuality.tokenEstimate`; warnings only appear above the 4K-token threshold and differ between compact and full mode.
13. **Full-mode chunk metadata** — when available, full-mode results surface chunk-level `imports` (top 5), `exports` (top 5), and cyclomatic `complexity`.

### Defaults

Expand Down
Loading