perf(filtering): fat-row-independent metadata delete re-sequencing (#136)#141
Conversation
…nt (#136) Deleting a single file from a large fat metadata DB took seconds because `_subset_` is the `INTEGER PRIMARY KEY` (rowid): re-sequencing it on delete relocates every shifted row in the table b-tree, rewriting overflow pages for large TEXT columns. A `COUNT(*)` guard added a full-table scan on top. This keeps every column physically in METADATA (so a deployed next-plaid-api reads the index unchanged — `SELECT *` output is byte-identical) while removing the hotspot: - v1 layout: `_subset_` becomes a regular `INTEGER NOT NULL` column with a plain index instead of the rowid PK. Re-sequencing `UPDATE`s no longer relocate rows in the b-tree. All joins/FTS already key on the `_subset_` value, not the physical rowid, so behavior is unchanged. - Lazy, gated migration: legacy v0 indexes (where `_subset_` is the PK) rebuild to v1 once on the first delete; `PRAGMA user_version` makes the check O(1) thereafter. Pure row shuffle, no re-encoding. - Replace the `COUNT(*)` re-sequence guard with an O(1) `MAX(_subset_)`. Measured (single-file delete, code~24KB/row): - localized / recent deletes (high `_subset_`): ~100-180x faster (e.g. 50K rows 1.33GB: 1379ms -> 8ms) -- the common incremental-update case is now instant. - worst case (delete oldest content, shifts ~all rows): ~2-3x (10.8s -> 4.8s). Bounded by SQLite re-reading each shifted row's payload on UPDATE; going further needs a thin/fat split, which would break forward compatibility. Tests: dense re-sequencing, v0->v1 migration, and forward-compat column identity; full next-plaid suite green. Heavy benchmark added as an #[ignore]d test. Refs #136
Stress test results — deletion & incremental updates, both surfaces, both versionsValidated that the v1 metadata layout works correctly under heavy add/delete/update churn and stays forward/backward compatible with indexes from a deployed next-plaid-api. Model: 1. colgrep end-to-end — 19/19 ✅Real repo: index → search → delete files → re-index → edit → add → second delete wave. Keyword-search correctness asserted at every step (+ semantic-search smoke check). Deletion, 2. next-plaid-api — single-version heavy churn (v1 image) — 7/7 ops ✅SciFact add/delete/re-add cycle, counts exact at every step:
Final 5183 docs. Scores MAP 0.698 / NDCG@10 0.737 / Recall@100 0.948 — identical to 3. Cross-version — backward compat (OLD v0 → NEW) — 9/9 ✅OLD ( 4. Cross-version — forward compat (NEW v1 → OLD) ✅NEW builds a v1 index → OLD (deployed image) then reads, metadata-filters, deletes from, and adds to it — all correct. Decisive apples-to-apples check: OLD and NEW evaluating the same v1 index produce byte-identical scores (max drift VerdictDeletion and incremental updates are correct under churn on both colgrep and next-plaid-api. The new layout is fully forward- and backward-compatible: old binaries read/search/mutate v1 indexes unchanged, and legacy v0 indexes migrate transparently on first delete. Nothing breaks. |
|
Nice work on the forward-compat approach. I left some follow-up benchmarks on #136 showing a remaining 26x gap between demoted-PK and thin-table for worst-case deletes (files indexed early in the project). Might be worth a look if you're considering a v2 layout down the line. |
Addresses #136 while preserving backward and forward compatibility with indexes produced by a deployed next-plaid-api.
Problem
Deleting a single file from a large fat metadata DB took seconds.
_subset_is theINTEGER PRIMARY KEY(rowid), so re-sequencing it on delete relocates every shifted row in the table b-tree — rewriting overflow pages for large TEXT columns (code,signature, …). ACOUNT(*)guard added a full-table scan on top.Measured on current
main(single-file delete, ~24 KB/row):Approach (compatible)
#136's thin/fat split would be fastest but breaks forward compatibility — an old binary requires
METADATA._subset_to physically hold the dense id. Instead this keeps every column inMETADATA(identicalSELECT *output) and removes the hotspot:_subset_becomes a regularINTEGER NOT NULLcolumn with a plain index instead of the rowid PK. Re-sequencingUPDATEs no longer relocate rows. All joins/FTS already key on the_subset_value, not the physical rowid, so behavior is unchanged._subset_is the PK) rebuild to v1 once on the first delete;PRAGMA user_versionmakes the check O(1) thereafter. Pure row shuffle, no re-encoding.COUNT(*)re-sequence guard with an O(1)MAX(_subset_).Results (main → this branch)
The common incremental-update case (deleting recently-indexed content) is now single-digit ms — ~100–180×. Worst case (re-indexing the oldest file, shifting every row) is ~2–3×; that's the ceiling for staying forward-compatible, since SQLite re-reads each shifted row's payload on
UPDATE.Compatibility
SELECT *columns are byte-identical to v0, so a deployed next-plaid-api reads/searches/mutates a v1 index unchanged.Tests
ci-quick).#[ignore]d test (metadata_delete_bench).Refs #136