feat(entitystore): TurboVec-backed vector index, scribe-ready#76
Open
VictorGjn wants to merge 5 commits into
Open
feat(entitystore): TurboVec-backed vector index, scribe-ready#76VictorGjn wants to merge 5 commits into
VictorGjn wants to merge 5 commits into
Conversation
…process test Under pytest fd capture on Windows the inherited stdin handle is invalid and DuplicateHandle raises WinError 6 (~50% of full-dir runs); pass stdin=DEVNULL to the --list-tools subprocess. 6/6 green runs post-fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ul top-k resolve_semantic and resolve_hybrid ranked the cache with a per-entry pure-Python cosine loop; replace it with a single float32 matmul over a stacked matrix. Same results (RRF fusion, min_score floor, return shapes pinned by the new rank tests), ~100x cheaper at corpus scale. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cb_vec.py is the vector seam for the engine carve: TurboQuant IdMapIndex (4-bit) when turbovec is importable, numpy brute-force tier with the same API otherwise (CE_DISABLE_TURBOVEC=1 forces it). Persisted collision-checked str->uint64 id mapping, content-hash invalidation via remove+re-add, allowlist (entity-id subset) filtering, sidecar save/load per corpus. semantic_rank() now serves from the sidecars while the JSON embedding cache stays the byte-identical source of truth; sidecar failure never blocks the JSON path; provider/model swap and >10% drift trigger auto-rebuild. turbovec stays an optional dependency: 97 tests run in both modes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cb_vec_gate.py measures recall@10 of the quantized index vs the exact float baseline and gates at >=0.98 (shrink-vector-store exit-code conventions: 0 pass / 2 precondition / 3 provenance / 4 gate fail). Runs on a corpus dir or explicit .npy overrides so it can re-validate real scribe-written vectors later. Measured on company-brain/corpora/syroco (2593x1024, 50 queries): recall@10 = 1.0000 on all tiers (numpy, pool, rescore, allowlist n=648), identical under CE_DISABLE_TURBOVEC=1, 0 id collisions. docs/turbovec-readiness.md records the flip-on procedure for scribes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ync, gate hardening Addresses the 6 major findings from the PR #76 review: - Sidecar meta v2: data files are token-named and referenced by name from meta.json (the atomic commit point) — concurrent savers can no longer cross-pair npy/tvim files; orphans and legacy fixed-name files are swept after the meta commit. Filename references are validated against path traversal at load. - Fingerprint as data: save() stamps the source fingerprint captured by the caller when the JSON cache was read/written (VectorStore.source_fp), never a fresh stat — a concurrent JSON rewrite cannot get our stamp. - Batched VectorStore.apply(removals, upserts): one masked drop + one vstack instead of per-entity full-matrix copies; _sync_sidecars now syncs in a single batch (was quadratic in churn). - Gate exit-code contract: malformed .npy / 0-d ids / bad dims now exit 2/3 with clear messages instead of raw-traceback rc 1; vacuous numpy self-comparison check removed. - Gate reuses cb_vec._assign_u64 and the new cb_vec.over_fetch_k instead of divergent clones — it now measures the exact production policy. - Minors: provider passed on the hot-path sidecar load, two-way tvim entry-count check, allow_pickle=False pinned, query shape-mismatch warning, deterministic id tie-break at the top-k boundary. entitystore: 140 tests, CE: 329+4, both turbovec and CE_DISABLE_TURBOVEC=1 modes; real-corpus gate (syroco, 2593x1024) recall@10=1.0000, rc=0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Owner
Author
|
@codex review |
|
To use Codex here, create a Codex account and connect to github. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The brain's semantic layer ranks with per-entry pure-Python cosine loops over a JSON cache. Fine at today's ~2.6k entities, but the scribes pipeline will push the corpus to chunk-level scale (100k+), where this becomes the bottleneck — and there is no index structure, no filtered search, and no story for memory growth.
Solution
Adds the vector seam the engine carve builds on, dormant until scribes need it:
entitystore/scripts/cb_vec.py— TurboQuantIdMapIndex(4-bit, turbovec 0.8.0) when importable; numpy brute-force tier with identical API otherwise.CE_DISABLE_TURBOVEC=1kill-switch. Persisted collision-checked str→uint64 id map, content-hash invalidation (remove+re-add), allowlist (entity-id subset) filtering for provenance-scoped search, sidecar save/load per corpus.semantic_rank()wiring — serves from sidecars; JSON embedding cache stays the byte-identical source of truth; sidecar failure never blocks the JSON path; provider swap / >10% drift auto-rebuilds.embed_resolve.py— per-entry cosine loop → one numpy float32 matmul top-k (same results, contracts pinned by new rank tests).cb_vec_gate.py— recall gate (recall@10 ≥ 0.98 vs exact float baseline, shrink-vector-store exit codes). Measured on the realcompany-brain/corpora/syrococorpus (2,593×1,024, 50 queries): recall@10 = 1.0000 on all tiers (pool, rescore, allowlist n=648, numpy fallback), 0 id collisions.WinError 6flake intest_lat_mcp.py(stdin=DEVNULL under pytest fd capture).turbovec stays an optional dependency — 118 entitystore + 329 context-engineering tests green with it installed and with the kill-switch on. Flip-on procedure for scribes:
entitystore/docs/turbovec-readiness.md.🤖 Generated with Claude Code