Skip to content

perf(filtering): fat-row-independent metadata delete re-sequencing (#136)#141

Merged
raphaelsty merged 1 commit into
mainfrom
perf/metadata-fast-delete
Jun 17, 2026
Merged

perf(filtering): fat-row-independent metadata delete re-sequencing (#136)#141
raphaelsty merged 1 commit into
mainfrom
perf/metadata-fast-delete

Conversation

@raphaelsty

Copy link
Copy Markdown
Collaborator

Addresses #136 while preserving backward and forward compatibility with indexes produced by a deployed next-plaid-api.

Problem

Deleting a single file from a large fat metadata DB took seconds. _subset_ is the INTEGER PRIMARY KEY (rowid), so re-sequencing it on delete relocates every shifted row in the table b-tree — rewriting overflow pages for large TEXT columns (code, signature, …). A COUNT(*) guard added a full-table scan on top.

Measured on current main (single-file delete, ~24 KB/row):

Rows / DB front mid end
50K / 1.33 GB 10.8 s 4.2 s 1.4 s
20K / 533 MB 3.0 s 1.8 s 0.5 s

Approach (compatible)

#136's thin/fat split would be fastest but breaks forward compatibility — an old binary requires METADATA._subset_ to physically hold the dense id. Instead this keeps every column in METADATA (identical SELECT * output) and removes the hotspot:

  • v1 layout: _subset_ becomes a regular INTEGER NOT NULL column with a plain index instead of the rowid PK. Re-sequencing UPDATEs no longer relocate rows. All joins/FTS already key on the _subset_ value, not the physical rowid, so behavior is unchanged.
  • Lazy, gated migration: legacy v0 indexes (where _subset_ is the PK) rebuild to v1 once on the first delete; PRAGMA user_version makes the check O(1) thereafter. Pure row shuffle, no re-encoding.
  • Floor win: replace the COUNT(*) re-sequence guard with an O(1) MAX(_subset_).

Results (main → this branch)

Rows / DB position main new speedup
50K / 1.33 GB recent (end) 1379 ms 8 ms 178×
middle 4189 ms 2403 ms 1.7×
oldest (front) 10840 ms 4825 ms 2.2×
20K / 533 MB recent 504 ms 5 ms 108×
50K / 256 MB (4 KB rows) recent 452 ms 4 ms 102×

The common incremental-update case (deleting recently-indexed content) is now single-digit ms — ~100–180×. Worst case (re-indexing the oldest file, shifting every row) is ~2–3×; that's the ceiling for staying forward-compatible, since SQLite re-reads each shifted row's payload on UPDATE.

Compatibility

  • Forward: SELECT * columns are byte-identical to v0, so a deployed next-plaid-api reads/searches/mutates a v1 index unchanged.
  • Backward: v0 indexes are read directly and migrated to v1 on first delete.

Tests

  • Unit: dense re-sequencing, v0→v1 migration, forward-compat column identity; full next-plaid suite green (557 tests via ci-quick).
  • Reproducible benchmark added as an #[ignore]d test (metadata_delete_bench).
  • Cross-version + colgrep/api stress validation in the PR thread.

Refs #136

…nt (#136)

Deleting a single file from a large fat metadata DB took seconds because
`_subset_` is the `INTEGER PRIMARY KEY` (rowid): re-sequencing it on delete
relocates every shifted row in the table b-tree, rewriting overflow pages for
large TEXT columns. A `COUNT(*)` guard added a full-table scan on top.

This keeps every column physically in METADATA (so a deployed next-plaid-api
reads the index unchanged — `SELECT *` output is byte-identical) while removing
the hotspot:

- v1 layout: `_subset_` becomes a regular `INTEGER NOT NULL` column with a plain
  index instead of the rowid PK. Re-sequencing `UPDATE`s no longer relocate rows
  in the b-tree. All joins/FTS already key on the `_subset_` value, not the
  physical rowid, so behavior is unchanged.
- Lazy, gated migration: legacy v0 indexes (where `_subset_` is the PK) rebuild
  to v1 once on the first delete; `PRAGMA user_version` makes the check O(1)
  thereafter. Pure row shuffle, no re-encoding.
- Replace the `COUNT(*)` re-sequence guard with an O(1) `MAX(_subset_)`.

Measured (single-file delete, code~24KB/row):
- localized / recent deletes (high `_subset_`): ~100-180x faster (e.g. 50K rows
  1.33GB: 1379ms -> 8ms) -- the common incremental-update case is now instant.
- worst case (delete oldest content, shifts ~all rows): ~2-3x (10.8s -> 4.8s).
  Bounded by SQLite re-reading each shifted row's payload on UPDATE; going
  further needs a thin/fat split, which would break forward compatibility.

Tests: dense re-sequencing, v0->v1 migration, and forward-compat column identity;
full next-plaid suite green. Heavy benchmark added as an #[ignore]d test.

Refs #136
@raphaelsty

Copy link
Copy Markdown
Collaborator Author

Stress test results — deletion & incremental updates, both surfaces, both versions

Validated that the v1 metadata layout works correctly under heavy add/delete/update churn and stays forward/backward compatible with indexes from a deployed next-plaid-api. Model: lightonai/answerai-colbert-small-v1-onnx.

1. colgrep end-to-end — 19/19 ✅

Real repo: index → search → delete files → re-index → edit → add → second delete wave. Keyword-search correctness asserted at every step (+ semantic-search smoke check). Deletion, _subset_ re-sequencing, FTS5 pruning, and incremental updates all stay correct across multiple churn waves.

2. next-plaid-api — single-version heavy churn (v1 image) — 7/7 ops ✅

SciFact add/delete/re-add cycle, counts exact at every step:

step op expected got
1 ADD 1000 1000 1000
2 DELETE 200 800 800
3 ADD 200 1000 1000
4 ADD 1000 2000 2000
5 DELETE 500 1500 1500
6 ADD 2983 4483 4483
7 ADD 700 5183 5183

Final 5183 docs. Scores MAP 0.698 / NDCG@10 0.737 / Recall@100 0.948 — identical to main, no regression.

3. Cross-version — backward compat (OLD v0 → NEW) — 9/9 ✅

OLD (origin/main, v0 layout) builds + populates a full SciFact index → NEW reads it → NEW deletes half, triggering the v0→v1 migration → NEW re-adds. Counts exact, metadata-filtered search correct (0 violations), retrieval scores within 0.0016 of the OLD baseline.

4. Cross-version — forward compat (NEW v1 → OLD) ✅

NEW builds a v1 index → OLD (deployed image) then reads, metadata-filters, deletes from, and adds to it — all correct. Decisive apples-to-apples check: OLD and NEW evaluating the same v1 index produce byte-identical scores (max drift 0.00000, identical counts). A deployed next-plaid-api searches/mutates a v1 index exactly like the new code.

Verdict

Deletion and incremental updates are correct under churn on both colgrep and next-plaid-api. The new layout is fully forward- and backward-compatible: old binaries read/search/mutate v1 indexes unchanged, and legacy v0 indexes migrate transparently on first delete. Nothing breaks.

@raphaelsty raphaelsty merged commit 8889923 into main Jun 17, 2026
20 checks passed
@vlasky

vlasky commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Nice work on the forward-compat approach. I left some follow-up benchmarks on #136 showing a remaining 26x gap between demoted-PK and thin-table for worst-case deletes (files indexed early in the project). Might be worth a look if you're considering a v2 layout down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants