Skip to content

fix: re-embed deletes the vector it just inserted (single-chunk entries vanish from recall)#136

Merged
rahilp merged 1 commit into
rahilp:mainfrom
mikestanley00:fix/reembed-preserve-reused-vector
Jun 8, 2026
Merged

fix: re-embed deletes the vector it just inserted (single-chunk entries vanish from recall)#136
rahilp merged 1 commit into
rahilp:mainfrom
mikestanley00:fix/reembed-preserve-reused-vector

Conversation

@mikestanley00

Copy link
Copy Markdown
Contributor

Summary

For single-chunk entries, re-embedding could delete the entry's vector from Vectorize, leaving the row in D1 but unsearchable. This deletes only genuinely-stale vectors instead.

Root cause

storeEntry keys a single chunk by the entry id:

id: chunks.length === 1 ? id : `${id}-chunk-${i}`

The re-embed paths use "insert new → delete old", deleting the full previous vector_ids set. For a single-chunk entry the new vector reuses the old id, so the cleanup step deletes the vector that was just inserted. The entry then exists in D1 with vector_ids pointing at a vector that's no longer in the index, so recall never returns it — semantic search, and even exact-term queries, miss it.

Affected: POST /update, the MCP update tool, the large-append re-embed, and the smart-merge / replace capture paths.

Fix

Add deleteStaleVectors(old, new) which deletes only ids not reused by the new embedding, and route the four re-embed sites through it. The genuine full-deletes (forget, conflicting-entry removal) are unchanged.

Reproduce (before this PR)

  1. Capture a short (single-chunk) memory.
  2. Update it (POST /update), or trigger a smart-merge / replace.
  3. recall no longer returns it; wrangler vectorize get-vectors <index> --ids <entryId> returns nothing, though D1 still lists it in vector_ids.

Testing

  • Updated the four tests that asserted the old full-delete behavior.
  • Added a single-chunk id-reuse regression test.
  • npm run typecheck clean; npm run test:coverage → 271 tests pass.

For single-chunk entries the Vectorize id equals the entry id (storeEntry
keys a single chunk by `id`). The "insert new -> delete old" re-embed
pattern then deleted the full previous `vector_ids` set — including the id
the new embedding had just reused — so the entry was left in D1 but absent
from Vectorize, and therefore invisible to recall (semantic search, and even
exact-term queries).

This affected POST /update, the MCP `update` tool, the large-append re-embed
path, and the smart-merge / replace capture paths.

Add `deleteStaleVectors(old, new)` which deletes only ids not reused by the
new embedding, and route the four re-embed sites through it. The genuine
full-deletes (forget, conflicting-entry removal) are unchanged.

Update the four tests that asserted the old full-delete behavior, and add a
single-chunk id-reuse regression test. typecheck + 271 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rahilp

rahilp commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks @mikestanley00 -- great find and perfect fix! Merging this in to the branch now and will be available in main. I'll be cutting a new release later this week and this will be included in that release! Thank you so much!

@rahilp rahilp merged commit e7eefc6 into rahilp:main Jun 8, 2026
1 check passed
@rahilp rahilp linked an issue Jun 8, 2026 that may be closed by this pull request
@mikestanley00 mikestanley00 deleted the fix/reembed-preserve-reused-vector branch June 8, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single-chunk memories disappear from recall after update/merge

2 participants