Extract vector encoding queue#25
Conversation
Vector embedding queues were only available inside msgvault, which made other Kenn tools copy or reimplement the same crash-safe claim/release mechanics. Moving the SQL-backed task queue into kit gives callers a shared, app-neutral helper while preserving msgvault's existing table shape. The package keeps schema ownership with callers and validates SQL identifiers before interpolating table or column names, so future consumers can adapt the queue to their own storage without importing msgvault internals. Validation: go test ./...; go vet ./... Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
|
how does this relate to kenn-io/msgvault#411? |
|
doesn't yet? i guess we can reframe it based on that? |
|
yeah, let met take a closer look at that and see if it makes sense to merge and then we can reassess. I agree it doesn't make sense for every project to have its own vector embedding pipeline management system |
The initial extraction lifted msgvault's pending_embeddings claim queue, but that is storage and scheduling policy a consumer owns, and msgvault is already moving off it toward scan-and-fill (kenn-io/msgvault#411). Shipping it would have given kit a reusable package msgvault no longer uses. The surface that is genuinely shared across callers — msgvault and kata both embed for search — is chunking, model-generation identity, batched encoding, and merging results across generations. This adds those as pure transforms, plus a Store[K,G] contract and Fill/Search flows that own the scan-and-fill and query-and-merge orchestration so a backend supplies only SQL. Document and generation identity stay opaque (msgvault int64, kata UUID); storage and query construction live in backend subpackages. vector/sqlitevec is the reference backend on the same sqlite-vec binding msgvault uses, with per-generation vec0 tables so a model migration across differing dimensions still serves a union of live generations. This also fixes the go.mod tidy drift that was failing Go hygiene. Validation: go build/vet/test ./... with CGO; vec0 exercised hermetically via the bundled sqlite-vec extension. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
Two medium findings from the roborev panel on 061a0a3. Generation.Fingerprint built its preimage by joining params as key=value lines, so a value containing the separator could hash identically to a different param set — distinct vector spaces could share a fingerprint and silently skip a needed re-embed. It now hashes the JSON encoding, which escapes values and sorts map keys, so the preimage is unambiguous. SaveVectors stamped the source document without checking the update hit a row, so if a document was deleted between scan and save (or a caller passed a missing key) the transaction committed vector rows with no backing document, which QueryGeneration would later return as orphan hits. It now checks RowsAffected and rolls back when nothing was stamped. Validation: go vet/test ./vector/... with CGO. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
|
The comment by roborev is mostly worthless for the last one since thats more or less just an exampe |
The fingerprint is persisted by callers and compared to decide whether to re-embed, so its stability has to survive future edits to Generation. The prior version hand-listed the fields to hash, which has a dangerous failure mode: a field added later would be silently excluded, letting two distinct vector spaces share a fingerprint and skip a needed re-embed. Fingerprint now encodes the struct itself, so any field added later participates automatically, then re-encodes through a generic value. encoding/json sorts object keys at every level, so neither struct field order nor map insertion order affects the hash; UseNumber preserves numeric precision; and omitempty keeps an unused new field from shifting existing fingerprints. A pinned canonical-encoding test and a reflection tripwire on the field set make any change to the type or encoding fail CI rather than silently re-fingerprint every stored vector. Validation: go vet/test ./vector/... with CGO. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
roborev: Combined Review (
|
|
Codex posting on behalf of Wes. I like the direction this PR landed on after the pivot away from extracting the old The reusable pieces here look like the right layer for kit:
For msgvault specifically, I would not try to consume this immediately before merging kenn-io/msgvault#411. Msgvault still has extra production semantics around mutable message CAS, watermark/backstop recovery, skip/delete handling, lifecycle gates, stats, filters, and hybrid FTS/vector search. But those feel like product/backend-specific layers above this package, not arguments against the extraction. One caveat for future evolution: the current Overall: this seems like a good foundation for reusable vector infrastructure, and msgvault#411 is useful context for why scan-and-fill/generation merge is the right shared direction rather than a reusable pending queue. |
Vector encoding queues were previously only available inside msgvault, which made the crash-safe claim/release/complete mechanics hard to reuse in other Kenn tools. This PR moves that SQL-backed task queue into kit as an app-neutral helper while keeping schema ownership with callers.
The queue validates configured table and column identifiers before interpolating SQL, supports transactional bulk enqueue for callers that need to select groups and insert work atomically, and preserves msgvault's existing pending_embeddings table shape as the default schema.
Validation: go test ./...; go vet ./...
generated by a clanker