Skip to content

Extract vector encoding queue#25

Open
mariusvniekerk wants to merge 4 commits into
mainfrom
codex/extract-vector-encoding-queue
Open

Extract vector encoding queue#25
mariusvniekerk wants to merge 4 commits into
mainfrom
codex/extract-vector-encoding-queue

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Contributor

Vector encoding queues were previously only available inside msgvault, which made the crash-safe claim/release/complete mechanics hard to reuse in other Kenn tools. This PR moves that SQL-backed task queue into kit as an app-neutral helper while keeping schema ownership with callers.

The queue validates configured table and column identifiers before interpolating SQL, supports transactional bulk enqueue for callers that need to select groups and insert work atomically, and preserves msgvault's existing pending_embeddings table shape as the default schema.

Validation: go test ./...; go vet ./...

generated by a clanker

Vector embedding queues were only available inside msgvault, which made other Kenn tools copy or reimplement the same crash-safe claim/release mechanics. Moving the SQL-backed task queue into kit gives callers a shared, app-neutral helper while preserving msgvault's existing table shape.

The package keeps schema ownership with callers and validates SQL identifiers before interpolating table or column names, so future consumers can adapt the queue to their own storage without importing msgvault internals.

Validation: go test ./...; go vet ./...

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (e07897b)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 2m25s), codex_security (codex/security, done, 1m25s) | Total: 3m50s

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

how does this relate to kenn-io/msgvault#411?

@mariusvniekerk

Copy link
Copy Markdown
Contributor Author

doesn't yet? i guess we can reframe it based on that?

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

yeah, let met take a closer look at that and see if it makes sense to merge and then we can reassess. I agree it doesn't make sense for every project to have its own vector embedding pipeline management system

The initial extraction lifted msgvault's pending_embeddings claim queue, but that is storage and scheduling policy a consumer owns, and msgvault is already moving off it toward scan-and-fill (kenn-io/msgvault#411). Shipping it would have given kit a reusable package msgvault no longer uses.

The surface that is genuinely shared across callers — msgvault and kata both embed for search — is chunking, model-generation identity, batched encoding, and merging results across generations. This adds those as pure transforms, plus a Store[K,G] contract and Fill/Search flows that own the scan-and-fill and query-and-merge orchestration so a backend supplies only SQL.

Document and generation identity stay opaque (msgvault int64, kata UUID); storage and query construction live in backend subpackages. vector/sqlitevec is the reference backend on the same sqlite-vec binding msgvault uses, with per-generation vec0 tables so a model migration across differing dimensions still serves a union of live generations. This also fixes the go.mod tidy drift that was failing Go hygiene.

Validation: go build/vet/test ./... with CGO; vec0 exercised hermetically via the bundled sqlite-vec extension.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (061a0a3)

Medium-risk issues remain in the vector toolkit changes.

Medium

  • vector/sqlitevec/store.go:93 - SaveVectors ignores RowsAffected when stamping the source document. If the document was deleted between scan and save, or a caller passes a missing key, the transaction can commit orphan vector rows that QueryGeneration may later return. Check the update result and roll back when no source row was stamped.

  • vector/generation.go:42 - Fingerprint serializes params as key=value\n without escaping or length delimiters, so different configs can hash the same preimage, such as a value containing \nother=value. Use length-prefixed fields or a structured sorted encoding before hashing.


Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m27s), codex_security (codex/security, done, 2m2s) | Total: 6m35s

Two medium findings from the roborev panel on 061a0a3. Generation.Fingerprint built its preimage by joining params as key=value lines, so a value containing the separator could hash identically to a different param set — distinct vector spaces could share a fingerprint and silently skip a needed re-embed. It now hashes the JSON encoding, which escapes values and sorts map keys, so the preimage is unambiguous.

SaveVectors stamped the source document without checking the update hit a row, so if a document was deleted between scan and save (or a caller passed a missing key) the transaction committed vector rows with no backing document, which QueryGeneration would later return as orphan hits. It now checks RowsAffected and rolls back when nothing was stamped.

Validation: go vet/test ./vector/... with CGO.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (792a47d)

Medium-risk issue found; no Critical or High findings.

Medium

  • vector/sqlitevec/store.go:180 - QueryGeneration joins KNN results only to the chunk map, so vectors for documents deleted after embedding can still be returned as hits. Filter results against Schema.DocsTable on the configured id column, or add source-row deletion cleanup; add a test that embeds a document, deletes it, and verifies search no longer returns it.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 2m16s) | Total: 5m39s

@mariusvniekerk

Copy link
Copy Markdown
Contributor Author

The comment by roborev is mostly worthless for the last one since thats more or less just an exampe

The fingerprint is persisted by callers and compared to decide whether to re-embed, so its stability has to survive future edits to Generation. The prior version hand-listed the fields to hash, which has a dangerous failure mode: a field added later would be silently excluded, letting two distinct vector spaces share a fingerprint and skip a needed re-embed.

Fingerprint now encodes the struct itself, so any field added later participates automatically, then re-encodes through a generic value. encoding/json sorts object keys at every level, so neither struct field order nor map insertion order affects the hash; UseNumber preserves numeric precision; and omitempty keeps an unused new field from shifting existing fingerprints. A pinned canonical-encoding test and a reflection tripwire on the field set make any change to the type or encoding fail CI rather than silently re-fingerprint every stored vector.

Validation: go vet/test ./vector/... with CGO.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (2a9fde1)

PR has two medium findings; no high or critical issues were reported.

Medium

  • vector/encode.go:83: EncodeBatched checks failed() before blocking on the semaphore. If an in-flight batch sets firstErr while the loop is waiting for capacity, the next batch can still launch after the semaphore is released, causing extra encoder/provider calls despite the documented “stops launching work at the first error” behavior.
    Fix: Re-check firstErr and ctx.Err() immediately after acquiring the semaphore, release the slot, and stop dispatching if work should stop.

  • vector/sqlitevec/store.go:180: QueryGeneration joins KNN results only to the chunk map, not back to the caller’s documents table. If a document is deleted after vectors are saved, stale chunk rows can still return hits for a document that no longer exists.
    Fix: Join the configured documents table on IDColumn when returning hits, or add cascading cleanup for deleted documents, with a regression test for deleted documents not being returned.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 15s) | Total: 3m38s

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

Codex posting on behalf of Wes.

I like the direction this PR landed on after the pivot away from extracting the old pending_embeddings queue. That queue is no longer the durable shared abstraction for msgvault, especially with kenn-io/msgvault#411 moving embedding work to scan-and-fill coverage.

The reusable pieces here look like the right layer for kit:

  • generation identity/fingerprinting for model + dimension + embedding-affecting params
  • chunk splitting and batched encode orchestration
  • a backend-neutral Store[K,G] boundary
  • scan-and-fill flow over that store
  • live-generation search fanout plus merge semantics during active/building migrations
  • sqlitevec as a reference/default backend for new SQLite-backed projects

For msgvault specifically, I would not try to consume this immediately before merging kenn-io/msgvault#411. Msgvault still has extra production semantics around mutable message CAS, watermark/backstop recovery, skip/delete handling, lifecycle gates, stats, filters, and hybrid FTS/vector search. But those feel like product/backend-specific layers above this package, not arguments against the extraction.

One caveat for future evolution: the current Fill contract only passes {Doc, Content} through Pending, so projects with mutable source rows need to hide content-version/CAS checks inside their store implementation. If we want this to become the common production fill loop for msgvault-like systems, we may eventually want an explicit version token or CAS-aware save shape.

Overall: this seems like a good foundation for reusable vector infrastructure, and msgvault#411 is useful context for why scan-and-fill/generation merge is the right shared direction rather than a reusable pending queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants