Skip to content

perf(index): reduce incremental delete from O(index_size) to O(IVF_size) (rebase of #135)#138

Closed
raphaelsty wants to merge 2 commits into
mainfrom
perf/incremental-delete-ivf
Closed

perf(index): reduce incremental delete from O(index_size) to O(IVF_size) (rebase of #135)#138
raphaelsty wants to merge 2 commits into
mainfrom
perf/incremental-delete-ivf

Conversation

@raphaelsty

Copy link
Copy Markdown
Collaborator

Summary

Rebase of #135 (by @vlasky) onto current main (post #137). No conflicts — the FTS5-in-delete code #135 builds on came from #132 (already in 1.5.5), so it merged cleanly.

The change reduces incremental-delete cost from O(index_size) to ~O(IVF_size):

  • In-place IVF patch instead of a full rebuild that re-read every chunk's codes (filter deleted IDs, renumber survivors via binary search).
  • Range-based SQLite re-sequencing (UPDATE ... SET _subset_ = _subset_ - shift) instead of copying the whole METADATA table through a temp table.
  • MmapIndex::update_append to skip merged-file regeneration during incremental updates (deferred to the next search, which regenerates lazily).
  • colgrep skips the intermediate FTS5 rebuild on deletes that are immediately followed by re-encoding (delete_files_from_index_no_fts_rebuild).

Status / review notes (see PR discussion)

  • Validated end-to-end: keyword (-e) search returns correct results after incremental edits/deletes (deleted content not surfaced, survivors resolve correctly).
  • Open considerations being measured/assessed: raw-FTS rows are left stale until a delete-only update rebuilds them; range re-sequencing assumes in-range delete IDs; IVF patch preserves (does not dedup) existing per-centroid entries.

Co-authored with @vlasky (#135).

Vlad Lasky and others added 2 commits June 17, 2026 16:11
Three optimizations cut incremental update time from 60s to ~3s on large
indices (~90K docs, 12GB vector data, 5GB metadata):

1. IVF patch instead of full rebuild (delete.rs): Load existing IVF, filter
   deleted doc IDs and renumber survivors via binary search, write back.
   Avoids re-reading all chunk code files and BTreeMap construction.

2. In-place PK re-sequencing (filtering.rs): Replace the temp-table-copy
   approach (which copied all TEXT/BLOB data 3x) with targeted UPDATE
   statements that only modify the integer _subset_ column for affected
   ranges.

3. Skip merged file regeneration (index.rs + colgrep): Add
   MmapIndex::update_append that writes new chunks and merges the IVF
   without loading the full MmapIndex (which triggers merged mmap file
   generation). Merged files regenerate lazily on next search.

Also skips FTS5 rebuild on the delete hot path when encoding follows
immediately (changed files), deferring to the post-encoding rebuild.
…ze) (#135)

Rebase of #135 onto main (post #137).

Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>
raphaelsty added a commit that referenced this pull request Jun 17, 2026
…supersedes #135/#138) (#139)

* perf(index): reduce incremental delete from O(index_size) to O(IVF_size)

Three optimizations cut incremental update time from 60s to ~3s on large
indices (~90K docs, 12GB vector data, 5GB metadata):

1. IVF patch instead of full rebuild (delete.rs): Load existing IVF, filter
   deleted doc IDs and renumber survivors via binary search, write back.
   Avoids re-reading all chunk code files and BTreeMap construction.

2. In-place PK re-sequencing (filtering.rs): Replace the temp-table-copy
   approach (which copied all TEXT/BLOB data 3x) with targeted UPDATE
   statements that only modify the integer _subset_ column for affected
   ranges.

3. Skip merged file regeneration (index.rs + colgrep): Add
   MmapIndex::update_append that writes new chunks and merges the IVF
   without loading the full MmapIndex (which triggers merged mmap file
   generation). Merged files regenerate lazily on next search.

Also skips FTS5 rebuild on the delete hot path when encoding follows
immediately (changed files), deferring to the post-encoding rebuild.

* fix(index): harden #135 (re-sequencing robustness, FTS/merge clarity, tests)

Review fixes on top of the #135 rebase:

- filtering::delete range re-sequencing now clamps the deleted-id set to ids that
  were actually present and in range before computing shifts. A stray negative or
  out-of-range id previously corrupted every survivor's _subset_ id (and thus the
  metadata/FTS/IVF alignment); the old ROW_NUMBER path was robust to any input.
- delete.rs: document that the in-place IVF patch relies on (and preserves) create's
  sorted+deduped per-centroid doc-id layout — the same layout deployed/next-plaid-api
  indexes use, so the patch is format-safe.
- update_append doc: clarify it DEFERS merged-file regeneration to the next search
  (lazy), not eliminates it.
- colgrep: correct the inaccurate 'final FTS rebuild after encoding' comment — there
  is none; keyword search stays correct because it re-verifies FTS hits against each
  doc's current content; stale survivor rows realign on the next delete-only rebuild.

Tests:
- filtering: test_delete_resequence_ignores_out_of_range_and_negative_ids (fails
  without the clamp).
- delete: assert the in-place IVF patch keeps each centroid bucket deduped and
  ivf_lengths consistent.

Validated: clippy clean; next-plaid 122, colgrep 557 tests pass.

Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>

---------

Co-authored-by: Vlad Lasky <vlad.lasky@energyone.com>
Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>
@raphaelsty

Copy link
Copy Markdown
Collaborator Author

Superseded by #139 (merged into main as 8630af3), which includes this rebase plus the hardening fixes (id clamping, FTS5 pruning, lazy merged-file regen, sync/dedup tests, backward-compat). Closing as redundant.

@raphaelsty raphaelsty closed this Jun 17, 2026
@raphaelsty raphaelsty deleted the perf/incremental-delete-ivf branch June 17, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant