Extract vector encoding queue by mariusvniekerk · Pull Request #25 · kenn-io/kit

mariusvniekerk · 2026-06-24T19:23:35Z

Vector encoding queues were previously only available inside msgvault, which made the crash-safe claim/release/complete mechanics hard to reuse in other Kenn tools. This PR moves that SQL-backed task queue into kit as an app-neutral helper while keeping schema ownership with callers.

The queue validates configured table and column identifiers before interpolating SQL, supports transactional bulk enqueue for callers that need to select groups and insert work atomically, and preserves msgvault's existing pending_embeddings table shape as the default schema.

Validation: go test ./...; go vet ./...

^{generated by a clanker}

Vector embedding queues were only available inside msgvault, which made other Kenn tools copy or reimplement the same crash-safe claim/release mechanics. Moving the SQL-backed task queue into kit gives callers a shared, app-neutral helper while preserving msgvault's existing table shape. The package keeps schema ownership with callers and validates SQL identifiers before interpolating table or column names, so future consumers can adapt the queue to their own storage without importing msgvault internals. Validation: go test ./...; go vet ./... Generated with Codex Co-authored-by: Codex <codex@openai.com>

roborev-ci · 2026-06-24T19:27:19Z

roborev: Combined Review (`e07897b`)

No issues found.

Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 2m25s), codex_security (codex/security, done, 1m25s) | Total: 3m50s

wesm · 2026-06-24T20:39:25Z

how does this relate to kenn-io/msgvault#411?

mariusvniekerk · 2026-06-24T21:28:08Z

doesn't yet? i guess we can reframe it based on that?

wesm · 2026-06-24T23:07:12Z

yeah, let met take a closer look at that and see if it makes sense to merge and then we can reassess. I agree it doesn't make sense for every project to have its own vector embedding pipeline management system

The initial extraction lifted msgvault's pending_embeddings claim queue, but that is storage and scheduling policy a consumer owns, and msgvault is already moving off it toward scan-and-fill (kenn-io/msgvault#411). Shipping it would have given kit a reusable package msgvault no longer uses. The surface that is genuinely shared across callers — msgvault and kata both embed for search — is chunking, model-generation identity, batched encoding, and merging results across generations. This adds those as pure transforms, plus a Store[K,G] contract and Fill/Search flows that own the scan-and-fill and query-and-merge orchestration so a backend supplies only SQL. Document and generation identity stay opaque (msgvault int64, kata UUID); storage and query construction live in backend subpackages. vector/sqlitevec is the reference backend on the same sqlite-vec binding msgvault uses, with per-generation vec0 tables so a model migration across differing dimensions still serves a union of live generations. This also fixes the go.mod tidy drift that was failing Go hygiene. Validation: go build/vet/test ./... with CGO; vec0 exercised hermetically via the bundled sqlite-vec extension. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

roborev-ci · 2026-06-25T05:11:12Z

roborev: Combined Review (`061a0a3`)

Medium-risk issues remain in the vector toolkit changes.

Medium

vector/sqlitevec/store.go:93 - SaveVectors ignores RowsAffected when stamping the source document. If the document was deleted between scan and save, or a caller passes a missing key, the transaction can commit orphan vector rows that QueryGeneration may later return. Check the update result and roll back when no source row was stamped.
vector/generation.go:42 - Fingerprint serializes params as key=value\n without escaping or length delimiters, so different configs can hash the same preimage, such as a value containing \nother=value. Use length-prefixed fields or a structured sorted encoding before hashing.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m27s), codex_security (codex/security, done, 2m2s) | Total: 6m35s

Two medium findings from the roborev panel on 061a0a3. Generation.Fingerprint built its preimage by joining params as key=value lines, so a value containing the separator could hash identically to a different param set — distinct vector spaces could share a fingerprint and silently skip a needed re-embed. It now hashes the JSON encoding, which escapes values and sorts map keys, so the preimage is unambiguous. SaveVectors stamped the source document without checking the update hit a row, so if a document was deleted between scan and save (or a caller passed a missing key) the transaction committed vector rows with no backing document, which QueryGeneration would later return as orphan hits. It now checks RowsAffected and rolls back when nothing was stamped. Validation: go vet/test ./vector/... with CGO. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

roborev-ci · 2026-06-25T13:55:42Z

roborev: Combined Review (`792a47d`)

Medium-risk issue found; no Critical or High findings.

Medium

vector/sqlitevec/store.go:180 - QueryGeneration joins KNN results only to the chunk map, so vectors for documents deleted after embedding can still be returned as hits. Filter results against Schema.DocsTable on the configured id column, or add source-row deletion cleanup; add a test that embeds a document, deletes it, and verifies search no longer returns it.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 2m16s) | Total: 5m39s

mariusvniekerk · 2026-06-25T18:01:10Z

The comment by roborev is mostly worthless for the last one since thats more or less just an exampe

The fingerprint is persisted by callers and compared to decide whether to re-embed, so its stability has to survive future edits to Generation. The prior version hand-listed the fields to hash, which has a dangerous failure mode: a field added later would be silently excluded, letting two distinct vector spaces share a fingerprint and skip a needed re-embed. Fingerprint now encodes the struct itself, so any field added later participates automatically, then re-encodes through a generic value. encoding/json sorts object keys at every level, so neither struct field order nor map insertion order affects the hash; UseNumber preserves numeric precision; and omitempty keeps an unused new field from shifting existing fingerprints. A pinned canonical-encoding test and a reflection tripwire on the field set make any change to the type or encoding fail CI rather than silently re-fingerprint every stored vector. Validation: go vet/test ./vector/... with CGO. Generated with Claude Code (Opus 4.8) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

roborev-ci · 2026-06-25T18:15:41Z

roborev: Combined Review (`2a9fde1`)

PR has two medium findings; no high or critical issues were reported.

Medium

vector/encode.go:83: EncodeBatched checks failed() before blocking on the semaphore. If an in-flight batch sets firstErr while the loop is waiting for capacity, the next batch can still launch after the semaphore is released, causing extra encoder/provider calls despite the documented “stops launching work at the first error” behavior.
Fix: Re-check firstErr and ctx.Err() immediately after acquiring the semaphore, release the slot, and stop dispatching if work should stop.
vector/sqlitevec/store.go:180: QueryGeneration joins KNN results only to the chunk map, not back to the caller’s documents table. If a document is deleted after vectors are saved, stale chunk rows can still return hits for a document that no longer exists.
Fix: Join the configured documents table on IDColumn when returning hits, or add cascading cleanup for deleted documents, with a regression test for deleted documents not being returned.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 15s) | Total: 3m38s

wesm · 2026-06-25T18:59:35Z

Codex posting on behalf of Wes.

I like the direction this PR landed on after the pivot away from extracting the old pending_embeddings queue. That queue is no longer the durable shared abstraction for msgvault, especially with kenn-io/msgvault#411 moving embedding work to scan-and-fill coverage.

The reusable pieces here look like the right layer for kit:

generation identity/fingerprinting for model + dimension + embedding-affecting params
chunk splitting and batched encode orchestration
a backend-neutral Store[K,G] boundary
scan-and-fill flow over that store
live-generation search fanout plus merge semantics during active/building migrations
sqlitevec as a reference/default backend for new SQLite-backed projects

For msgvault specifically, I would not try to consume this immediately before merging kenn-io/msgvault#411. Msgvault still has extra production semantics around mutable message CAS, watermark/backstop recovery, skip/delete handling, lifecycle gates, stats, filters, and hybrid FTS/vector search. But those feel like product/backend-specific layers above this package, not arguments against the extraction.

One caveat for future evolution: the current Fill contract only passes {Doc, Content} through Pending, so projects with mutable source rows need to hide content-version/CAS checks inside their store implementation. If we want this to become the common production fill loop for msgvault-like systems, we may eventually want an explicit version token or CAS-aware save shape.

Overall: this seems like a good foundation for reusable vector infrastructure, and msgvault#411 is useful context for why scan-and-fill/generation merge is the right shared direction rather than a reusable pending queue.

mariusvniekerk mentioned this pull request Jun 24, 2026

Adopt kit vector encoding queue kenn-io/msgvault#415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract vector encoding queue#25

Extract vector encoding queue#25
mariusvniekerk wants to merge 4 commits into
mainfrom
codex/extract-vector-encoding-queue

mariusvniekerk commented Jun 24, 2026

Uh oh!

roborev-ci Bot commented Jun 24, 2026

Uh oh!

wesm commented Jun 24, 2026

Uh oh!

mariusvniekerk commented Jun 24, 2026

Uh oh!

wesm commented Jun 24, 2026

Uh oh!

roborev-ci Bot commented Jun 25, 2026

Uh oh!

roborev-ci Bot commented Jun 25, 2026

Uh oh!

mariusvniekerk commented Jun 25, 2026

Uh oh!

roborev-ci Bot commented Jun 25, 2026

Uh oh!

wesm commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mariusvniekerk commented Jun 24, 2026

Uh oh!

roborev-ci Bot commented Jun 24, 2026

roborev: Combined Review (e07897b)

Uh oh!

wesm commented Jun 24, 2026

Uh oh!

mariusvniekerk commented Jun 24, 2026

Uh oh!

wesm commented Jun 24, 2026

Uh oh!

roborev-ci Bot commented Jun 25, 2026

roborev: Combined Review (061a0a3)

Medium

Uh oh!

roborev-ci Bot commented Jun 25, 2026

roborev: Combined Review (792a47d)

Medium

Uh oh!

mariusvniekerk commented Jun 25, 2026

Uh oh!

roborev-ci Bot commented Jun 25, 2026

roborev: Combined Review (2a9fde1)

Medium

Uh oh!

wesm commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

roborev: Combined Review (`e07897b`)

roborev: Combined Review (`061a0a3`)

roborev: Combined Review (`792a47d`)

roborev: Combined Review (`2a9fde1`)