Skip to content

feat(entitystore): TurboVec-backed vector index, scribe-ready#76

Open
VictorGjn wants to merge 5 commits into
mainfrom
feature/turbovec-semantic-backend
Open

feat(entitystore): TurboVec-backed vector index, scribe-ready#76
VictorGjn wants to merge 5 commits into
mainfrom
feature/turbovec-semantic-backend

Conversation

@VictorGjn

Copy link
Copy Markdown
Owner

Problem

The brain's semantic layer ranks with per-entry pure-Python cosine loops over a JSON cache. Fine at today's ~2.6k entities, but the scribes pipeline will push the corpus to chunk-level scale (100k+), where this becomes the bottleneck — and there is no index structure, no filtered search, and no story for memory growth.

Solution

Adds the vector seam the engine carve builds on, dormant until scribes need it:

  • entitystore/scripts/cb_vec.py — TurboQuant IdMapIndex (4-bit, turbovec 0.8.0) when importable; numpy brute-force tier with identical API otherwise. CE_DISABLE_TURBOVEC=1 kill-switch. Persisted collision-checked str→uint64 id map, content-hash invalidation (remove+re-add), allowlist (entity-id subset) filtering for provenance-scoped search, sidecar save/load per corpus.
  • semantic_rank() wiring — serves from sidecars; JSON embedding cache stays the byte-identical source of truth; sidecar failure never blocks the JSON path; provider swap / >10% drift auto-rebuilds.
  • embed_resolve.py — per-entry cosine loop → one numpy float32 matmul top-k (same results, contracts pinned by new rank tests).
  • cb_vec_gate.py — recall gate (recall@10 ≥ 0.98 vs exact float baseline, shrink-vector-store exit codes). Measured on the real company-brain/corpora/syroco corpus (2,593×1,024, 50 queries): recall@10 = 1.0000 on all tiers (pool, rescore, allowlist n=648, numpy fallback), 0 id collisions.
  • Drive-by: fixed the pre-existing ~50% WinError 6 flake in test_lat_mcp.py (stdin=DEVNULL under pytest fd capture).

turbovec stays an optional dependency — 118 entitystore + 329 context-engineering tests green with it installed and with the kill-switch on. Flip-on procedure for scribes: entitystore/docs/turbovec-readiness.md.

🤖 Generated with Claude Code

VictorGjn and others added 5 commits June 11, 2026 19:40
…process test

Under pytest fd capture on Windows the inherited stdin handle is invalid
and DuplicateHandle raises WinError 6 (~50% of full-dir runs); pass
stdin=DEVNULL to the --list-tools subprocess. 6/6 green runs post-fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ul top-k

resolve_semantic and resolve_hybrid ranked the cache with a per-entry
pure-Python cosine loop; replace it with a single float32 matmul over a
stacked matrix. Same results (RRF fusion, min_score floor, return shapes
pinned by the new rank tests), ~100x cheaper at corpus scale.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cb_vec.py is the vector seam for the engine carve: TurboQuant IdMapIndex
(4-bit) when turbovec is importable, numpy brute-force tier with the same
API otherwise (CE_DISABLE_TURBOVEC=1 forces it). Persisted collision-checked
str->uint64 id mapping, content-hash invalidation via remove+re-add,
allowlist (entity-id subset) filtering, sidecar save/load per corpus.

semantic_rank() now serves from the sidecars while the JSON embedding cache
stays the byte-identical source of truth; sidecar failure never blocks the
JSON path; provider/model swap and >10% drift trigger auto-rebuild.

turbovec stays an optional dependency: 97 tests run in both modes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cb_vec_gate.py measures recall@10 of the quantized index vs the exact
float baseline and gates at >=0.98 (shrink-vector-store exit-code
conventions: 0 pass / 2 precondition / 3 provenance / 4 gate fail).
Runs on a corpus dir or explicit .npy overrides so it can re-validate
real scribe-written vectors later.

Measured on company-brain/corpora/syroco (2593x1024, 50 queries):
recall@10 = 1.0000 on all tiers (numpy, pool, rescore, allowlist n=648),
identical under CE_DISABLE_TURBOVEC=1, 0 id collisions.
docs/turbovec-readiness.md records the flip-on procedure for scribes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ync, gate hardening

Addresses the 6 major findings from the PR #76 review:

- Sidecar meta v2: data files are token-named and referenced by name from
  meta.json (the atomic commit point) — concurrent savers can no longer
  cross-pair npy/tvim files; orphans and legacy fixed-name files are swept
  after the meta commit. Filename references are validated against path
  traversal at load.
- Fingerprint as data: save() stamps the source fingerprint captured by
  the caller when the JSON cache was read/written (VectorStore.source_fp),
  never a fresh stat — a concurrent JSON rewrite cannot get our stamp.
- Batched VectorStore.apply(removals, upserts): one masked drop + one
  vstack instead of per-entity full-matrix copies; _sync_sidecars now
  syncs in a single batch (was quadratic in churn).
- Gate exit-code contract: malformed .npy / 0-d ids / bad dims now exit
  2/3 with clear messages instead of raw-traceback rc 1; vacuous numpy
  self-comparison check removed.
- Gate reuses cb_vec._assign_u64 and the new cb_vec.over_fetch_k instead
  of divergent clones — it now measures the exact production policy.
- Minors: provider passed on the hot-path sidecar load, two-way tvim
  entry-count check, allow_pickle=False pinned, query shape-mismatch
  warning, deterministic id tie-break at the top-k boundary.

entitystore: 140 tests, CE: 329+4, both turbovec and CE_DISABLE_TURBOVEC=1
modes; real-corpus gate (syroco, 2593x1024) recall@10=1.0000, rc=0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@VictorGjn

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant