fix: re-embed deletes the vector it just inserted (single-chunk entries vanish from recall)#136
Merged
rahilp merged 1 commit intoJun 8, 2026
Conversation
For single-chunk entries the Vectorize id equals the entry id (storeEntry keys a single chunk by `id`). The "insert new -> delete old" re-embed pattern then deleted the full previous `vector_ids` set — including the id the new embedding had just reused — so the entry was left in D1 but absent from Vectorize, and therefore invisible to recall (semantic search, and even exact-term queries). This affected POST /update, the MCP `update` tool, the large-append re-embed path, and the smart-merge / replace capture paths. Add `deleteStaleVectors(old, new)` which deletes only ids not reused by the new embedding, and route the four re-embed sites through it. The genuine full-deletes (forget, conflicting-entry removal) are unchanged. Update the four tests that asserted the old full-delete behavior, and add a single-chunk id-reuse regression test. typecheck + 271 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
|
Thanks @mikestanley00 -- great find and perfect fix! Merging this in to the branch now and will be available in main. I'll be cutting a new release later this week and this will be included in that release! Thank you so much! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For single-chunk entries, re-embedding could delete the entry's vector from Vectorize, leaving the row in D1 but unsearchable. This deletes only genuinely-stale vectors instead.
Root cause
storeEntrykeys a single chunk by the entry id:The re-embed paths use "insert new → delete old", deleting the full previous
vector_idsset. For a single-chunk entry the new vector reuses the old id, so the cleanup step deletes the vector that was just inserted. The entry then exists in D1 withvector_idspointing at a vector that's no longer in the index, sorecallnever returns it — semantic search, and even exact-term queries, miss it.Affected:
POST /update, the MCPupdatetool, the large-append re-embed, and the smart-merge / replace capture paths.Fix
Add
deleteStaleVectors(old, new)which deletes only ids not reused by the new embedding, and route the four re-embed sites through it. The genuine full-deletes (forget, conflicting-entry removal) are unchanged.Reproduce (before this PR)
POST /update), or trigger a smart-merge / replace.recallno longer returns it;wrangler vectorize get-vectors <index> --ids <entryId>returns nothing, though D1 still lists it invector_ids.Testing
npm run typecheckclean;npm run test:coverage→ 271 tests pass.