Skip to content

feat(vector): support scoped embedding builds#413

Open
danshapiro wants to merge 4 commits into
kenn-io:mainfrom
danshapiro:codex/scoped-embedding-refresh-pr
Open

feat(vector): support scoped embedding builds#413
danshapiro wants to merge 4 commits into
kenn-io:mainfrom
danshapiro:codex/scoped-embedding-refresh-pr

Conversation

@danshapiro

Copy link
Copy Markdown
Contributor

Depends on #412. This branch is stacked on the message_type filter PR until that lands.

  • Adds vector embedding build scope configuration for selected message types.
  • Applies the scope when seeding pending work for SQLite and PostgreSQL vector backends.
  • Enqueues embeddings for messages imported by Synctech SMS local/Drive syncs.

@roborev-ci

roborev-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

roborev: Combined Review (f3c8d00)

Medium confidence concerns remain; one security review found no exploit issues, but default review identified functional regressions.

Medium

  • internal/query/sqlite.go:1395
    The new message_type operator is parsed, but the CLI/MCP FTS search path uses query.Engine.Search, whose buildSearchQueryParts never applies q.MessageTypes. A filter-only query like message_type:sms can become a non-empty query that returns all live messages, and message_type:sms lunch searches across all message types.
    Fix: Add a m.message_type IN (...) predicate to buildSearchQueryParts and cover query.Engine.Search with a message_type regression test.

  • internal/vector/pgvector/filter.go:31
    vector.Filter.MessageTypes is populated for vector/hybrid searches, but the pgvector filter builder ignores it. PostgreSQL vector/hybrid searches with message_type can return unscoped results, and scoped hybrid searches can include out-of-scope FTS hits after scope validation passes.
    Fix: Add a PostgreSQL message type predicate to buildPGFilterClauses, such as m.message_type = ANY($N::text[]), and add pgvector Search/FusedSearch tests.

  • internal/synctechsms/importer.go:269
    Synctech imports now fail when UpsertFTS returns an error, even though FTS indexing was previously best-effort and the message data has already been written. This can mark a Drive item/sync failed and skip embedding enqueueing for already-upserted messages.
    Fix: Restore best-effort FTS handling: log the error and still return the message ID successfully.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 5m45s), codex_security (codex/security, done, 2m2s) | Total: 7m58s

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (7fce07b)

Summary verdict: Medium-risk correctness gaps remain around message-type scoping and vector enqueue behavior; no security findings were reported.

Medium

  • Location: internal/query/duckdb.go:2445
    Problem: DuckDB fast search receives search.Query.MessageTypes, but still forces emailOnlyFilterMsg and does not apply q.MessageTypes. With Parquet/DuckDB enabled, queries like message_type:sms lunch or message_type:sms can return email results.
    Fix: Add MessageTypes handling to DuckDB search condition builders and suppress or replace emailOnlyFilterMsg when an explicit message-type filter is present.

  • Location: internal/vector/embed/enqueue.go:261
    Problem: Scoped enqueue filtering builds one IN placeholder per message ID before existing insert chunking. Large Synctech imports or repair batches can exceed SQLite/PostgreSQL bind limits and fail to enqueue.
    Fix: Chunk filterMessageIDs with a safe maximum, or use a temp/json-table approach, then pass filtered chunks into the existing enqueue chunking.

  • Location: cmd/msgvault/cmd/add_synctech_sms_drive.go:144
    Problem: A vector enqueue failure after a successful Synctech import makes the whole scheduled job return an error and skips cache rebuild, even though import/checkpoint/source-item updates have already committed.
    Fix: Treat Synctech vector enqueue failures like Gmail/IMAP enqueue failures: log a warning and continue to cache rebuild.

  • Location: internal/mcp/handlers.go:532
    Problem: find_similar_messages bypasses the new scoped-index guard by calling ActiveGeneration and backend.Search directly without a possible message_type filter, so a scoped vector index can still be used as if it covered the whole archive.
    Fix: Resolve the active generation against the configured fingerprint and either reject non-empty build scopes here or derive/apply an allowed seed message_type filter before searching.


Panel: ci_default_security | Synthesis: codex, 12s | Members: codex_default (codex/default, done, 5m57s), codex_security (codex/security, done, 3m6s) | Total: 9m15s

@wesm wesm marked this pull request as ready for review June 25, 2026 03:52
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (d1efa2d)

Medium-risk issues remain before merge.

Medium

  • internal/remote/engine.go:805buildSearchQueryString omits q.MessageTypes, so remote Search, SearchFast, and count paths silently drop message_type: filters reconstructed from search.Query. A message-type-only query can become empty or unscoped.
    Fix: Append each q.MessageTypes value as message_type:<value> and add remote coverage for message-type filters.

  • cmd/msgvault/cmd/add_synctech_sms_drive.go:113 — The manual sync-synctech-sms path still calls the wrapper that passes a nil enqueuer, so imported Synctech messages are only queued for embeddings from the daemon schedule path, not from the documented one-shot sync command.
    Fix: Set up vector features in the command path, pass vf.Enqueuer to runConfiguredSynctechSMSSourceWithStoreAndEnqueuer, and close the vector features like the Gmail sync commands do.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 7m28s), codex_security (codex/security, done, 2m13s) | Total: 9m48s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (a432f4e)

Summary verdict: Two medium correctness issues remain; no security findings were reported.

Medium

  • Location: internal/vector/embed/enqueue.go:129
    Problem: The scoped enqueuer filters message IDs using the current config scope before selecting non-retired generations, so every non-retired generation is maintained with the current scope. If an existing unscoped generation is stale while [vector.embed.scope] is enabled, out-of-scope imports are never enqueued into it; if the user later removes the scope, that generation’s fingerprint matches again but its index is incomplete.
    Fix: Make enqueue filtering generation-specific, or prevent generations built under a different scope from being reused without a rebuild.

  • Location: internal/query/sqlite.go:1041
    Problem: SQLite fast-search stats still force emailOnlyFilterM even when SearchQuery contains message_type:sms/mms. The messages and count can be scoped correctly, but the returned stats are email-only or zero for non-email searches.
    Fix: Parse the stats search query once and skip the email-only predicate when MessageTypes is present, matching the search result predicate.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 8m58s), codex_security (codex/security, done, 3m57s) | Total: 13m4s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (9aea0b6)

Medium findings:

  • cmd/msgvault/cmd/add_synctech_sms_drive.go:113 — Manual sync-synctech-sms still routes through runConfiguredSynctechSMSSourceWithStore, which passes a nil enqueuer. As a result, vector-enabled manual Synctech imports are not queued for embedding; only the scheduled serve path wires vf.Enqueuer. Fix by setting up vector features in the manual command path, mirroring Gmail sync, and pass vf.Enqueuer into runConfiguredSynctechSMSSourceWithStoreAndEnqueuer.

  • cmd/msgvault/cmd/add_synctech_sms_drive.go:252 — When importOneDriveBackup returns both a partial ImportSummary and an error, the caller returns before merging fileSummary. Messages already persisted from the failed file are not included in summary.MessageIDs and are not enqueued. Fix by accumulating fileSummary into summary before checking err, then return the merged summary with the error.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 10m1s), codex_security (codex/security, done, 2m51s) | Total: 13m1s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (d08f444)

Medium issue remains: Synctech imports can enqueue already-embedded historical messages for re-embedding.

Medium

  • internal/synctechsms/importer.go:105
    ImportSummary.MessageIDs is populated for every successful UpsertMessage, including records that already existed. The new enqueue path consumes that list unconditionally, so retrying a failed Drive backup, re-running a local import, or importing a new full backup can requeue already-embedded historical messages and force expensive re-embedding.
    Fix: Only append IDs that were newly inserted or whose embeddable content changed, or have the enqueue path skip messages already embedded for the target generation.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 13m11s), codex_security (codex/security, done, 3m24s) | Total: 16m41s

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

I think I'll have to wait to rebase this after #411 lands

Add parsed message_type filters through the search, vector, remote, and MCP paths so scoped SMS/MMS/email queries do not silently widen after parsing. This keeps result rows, stats, DuckDB fast paths, SQLite FTS, pgvector filters, hybrid search, and remote query reconstruction aligned around the same user-visible operator.

The embedding scope work also needs durable enqueue behavior across Synctech imports and generation transitions. Preserve best-effort import semantics, enqueue already-persisted Synctech messages even after partial failures, wire the manual Synctech sync path into vector enqueueing, and filter pending embeddings per generation scope so full-corpus and scoped generations remain independently complete.

Included follow-up fixes:
- feat(vector): support scoped embedding builds
- fix(vector): enqueue synctech sms imports
- fix(search): honor message_type scopes
- fix(vector): close scoped search gaps
- fix(mcp): validate similar index before seed load
- fix(mcp): preserve no-active similar error
- fix(vector): align pg parity sqlite fixture
- fix(search): keep scoped stats consistent
- fix(search): complete scoped stats coverage
- fix(remote): preserve message type searches
- fix(vector): preserve scoped error contracts
- fix(vector): enqueue manual synctech syncs
- fix(vector): enqueue per generation scope

Generated with Codex
Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
@wesm wesm force-pushed the codex/scoped-embedding-refresh-pr branch from d08f444 to f1983c5 Compare June 25, 2026 21:55
The shared pgvector test schema already includes messages.message_type. Re-adding that column in individual tests makes the setup non-idempotent on PostgreSQL and can fail before the filters are exercised.

Remove the redundant ALTER TABLE statements so the tests rely on the common fixture schema and only seed message_type values needed by each case.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (dda21d4)

Summary verdict: one medium issue remains; no critical or high findings were reported.

Medium

  • cmd/msgvault/cmd/embeddings_manage.go:67: Scoped embedding builds are not honored by the management coverage path. embeddings activate still uses unscoped CoverageCounts, and openEmbeddingsBackend opens pgvector/sqlitevec without BuildScope, so a completed scoped generation can be reported as incomplete or fail activation because out-of-scope live messages are unstamped.
    • Fix: Use cfg.Vector.Embed.Scope.BuildScope() in fillCoverage/fillFullCoverage via CoverageCountsScoped, and pass the same BuildScope into both backend opens in openEmbeddingsBackend.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 6m35s), codex_security (codex/security, done, 2m39s) | Total: 9m22s

wesm and others added 2 commits June 25, 2026 18:50
Scoped embedding builds only stamp messages that match the configured build scope. Management commands were still evaluating activation coverage and backend coverage gates against the full live corpus, so a valid scoped generation could look incomplete whenever out-of-scope messages were unstamped.

Thread the configured build scope through management backend opens and coverage reads so list/activate/retire evaluate the same message universe as the build worker. The regression seeds an out-of-scope missing email beside an in-scope stamped SMS to lock the activation preflight to scoped coverage.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Scoped embedding coverage must keep all four displayed legs in the same message universe. After management switched live/stamped/missing to the configured build scope, the backend embedded count could still include out-of-scope stamped vectors, producing impossible list output such as embedded > live.

Apply the backend build scope inside EmbeddedMessageCount for sqlitevec and pgvector so the embedded leg matches scoped coverage reads. Add a command-level fillFullCoverage regression with an out-of-scope stamped vector, plus backend invariant tests for both storage engines.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (4715597)

No Medium, High, or Critical findings were reported.

All review outputs are clean at the requested severity threshold.


Panel: ci_default_security | Synthesis: codex, 5s | Members: codex_default (codex/default, done, 8m23s), codex_security (codex/security, done, 3m53s) | Total: 12m21s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants