feat(entitystore): TurboVec-backed vector index, scribe-ready by VictorGjn · Pull Request #76 · VictorGjn/agent-skills

VictorGjn · 2026-06-12T07:36:09Z

Problem

The brain's semantic layer ranks with per-entry pure-Python cosine loops over a JSON cache. Fine at today's ~2.6k entities, but the scribes pipeline will push the corpus to chunk-level scale (100k+), where this becomes the bottleneck — and there is no index structure, no filtered search, and no story for memory growth.

Solution

Adds the vector seam the engine carve builds on, dormant until scribes need it:

entitystore/scripts/cb_vec.py — TurboQuant IdMapIndex (4-bit, turbovec 0.8.0) when importable; numpy brute-force tier with identical API otherwise. CE_DISABLE_TURBOVEC=1 kill-switch. Persisted collision-checked str→uint64 id map, content-hash invalidation (remove+re-add), allowlist (entity-id subset) filtering for provenance-scoped search, sidecar save/load per corpus.
semantic_rank() wiring — serves from sidecars; JSON embedding cache stays the byte-identical source of truth; sidecar failure never blocks the JSON path; provider swap / >10% drift auto-rebuilds.
embed_resolve.py — per-entry cosine loop → one numpy float32 matmul top-k (same results, contracts pinned by new rank tests).
cb_vec_gate.py — recall gate (recall@10 ≥ 0.98 vs exact float baseline, shrink-vector-store exit codes). Measured on the real company-brain/corpora/syroco corpus (2,593×1,024, 50 queries): recall@10 = 1.0000 on all tiers (pool, rescore, allowlist n=648, numpy fallback), 0 id collisions.
Drive-by: fixed the pre-existing ~50% WinError 6 flake in test_lat_mcp.py (stdin=DEVNULL under pytest fd capture).

turbovec stays an optional dependency — 118 entitystore + 329 context-engineering tests green with it installed and with the kill-switch on. Flip-on procedure for scribes: entitystore/docs/turbovec-readiness.md.

🤖 Generated with Claude Code

…process test Under pytest fd capture on Windows the inherited stdin handle is invalid and DuplicateHandle raises WinError 6 (~50% of full-dir runs); pass stdin=DEVNULL to the --list-tools subprocess. 6/6 green runs post-fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ul top-k resolve_semantic and resolve_hybrid ranked the cache with a per-entry pure-Python cosine loop; replace it with a single float32 matmul over a stacked matrix. Same results (RRF fusion, min_score floor, return shapes pinned by the new rank tests), ~100x cheaper at corpus scale. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cb_vec.py is the vector seam for the engine carve: TurboQuant IdMapIndex (4-bit) when turbovec is importable, numpy brute-force tier with the same API otherwise (CE_DISABLE_TURBOVEC=1 forces it). Persisted collision-checked str->uint64 id mapping, content-hash invalidation via remove+re-add, allowlist (entity-id subset) filtering, sidecar save/load per corpus. semantic_rank() now serves from the sidecars while the JSON embedding cache stays the byte-identical source of truth; sidecar failure never blocks the JSON path; provider/model swap and >10% drift trigger auto-rebuild. turbovec stays an optional dependency: 97 tests run in both modes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cb_vec_gate.py measures recall@10 of the quantized index vs the exact float baseline and gates at >=0.98 (shrink-vector-store exit-code conventions: 0 pass / 2 precondition / 3 provenance / 4 gate fail). Runs on a corpus dir or explicit .npy overrides so it can re-validate real scribe-written vectors later. Measured on company-brain/corpora/syroco (2593x1024, 50 queries): recall@10 = 1.0000 on all tiers (numpy, pool, rescore, allowlist n=648), identical under CE_DISABLE_TURBOVEC=1, 0 id collisions. docs/turbovec-readiness.md records the flip-on procedure for scribes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ync, gate hardening Addresses the 6 major findings from the PR #76 review: - Sidecar meta v2: data files are token-named and referenced by name from meta.json (the atomic commit point) — concurrent savers can no longer cross-pair npy/tvim files; orphans and legacy fixed-name files are swept after the meta commit. Filename references are validated against path traversal at load. - Fingerprint as data: save() stamps the source fingerprint captured by the caller when the JSON cache was read/written (VectorStore.source_fp), never a fresh stat — a concurrent JSON rewrite cannot get our stamp. - Batched VectorStore.apply(removals, upserts): one masked drop + one vstack instead of per-entity full-matrix copies; _sync_sidecars now syncs in a single batch (was quadratic in churn). - Gate exit-code contract: malformed .npy / 0-d ids / bad dims now exit 2/3 with clear messages instead of raw-traceback rc 1; vacuous numpy self-comparison check removed. - Gate reuses cb_vec._assign_u64 and the new cb_vec.over_fetch_k instead of divergent clones — it now measures the exact production policy. - Minors: provider passed on the hot-path sidecar load, two-way tvim entry-count check, allow_pickle=False pinned, query shape-mismatch warning, deterministic id tie-break at the top-k boundary. entitystore: 140 tests, CE: 329+4, both turbovec and CE_DISABLE_TURBOVEC=1 modes; real-corpus gate (syroco, 2593x1024) recall@10=1.0000, rc=0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

VictorGjn · 2026-06-12T15:41:38Z

@codex review

chatgpt-codex-connector · 2026-06-12T15:41:45Z

To use Codex here, create a Codex account and connect to github.

VictorGjn and others added 5 commits June 11, 2026 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(entitystore): TurboVec-backed vector index, scribe-ready#76

feat(entitystore): TurboVec-backed vector index, scribe-ready#76
VictorGjn wants to merge 5 commits into
mainfrom
feature/turbovec-semantic-backend

VictorGjn commented Jun 12, 2026

Uh oh!

VictorGjn commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VictorGjn commented Jun 12, 2026

Problem

Solution

Uh oh!

VictorGjn commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant