perf(filtering): fat-row-independent metadata delete re-sequencing (#136) by raphaelsty · Pull Request #141 · lightonai/next-plaid

raphaelsty · 2026-06-17T12:54:51Z

Addresses #136 while preserving backward and forward compatibility with indexes produced by a deployed next-plaid-api.

Problem

Deleting a single file from a large fat metadata DB took seconds. _subset_ is the INTEGER PRIMARY KEY (rowid), so re-sequencing it on delete relocates every shifted row in the table b-tree — rewriting overflow pages for large TEXT columns (code, signature, …). A COUNT(*) guard added a full-table scan on top.

Measured on current main (single-file delete, ~24 KB/row):

Rows / DB	front	mid	end
50K / 1.33 GB	10.8 s	4.2 s	1.4 s
20K / 533 MB	3.0 s	1.8 s	0.5 s

Approach (compatible)

#136's thin/fat split would be fastest but breaks forward compatibility — an old binary requires METADATA._subset_ to physically hold the dense id. Instead this keeps every column in METADATA (identical SELECT * output) and removes the hotspot:

v1 layout: _subset_ becomes a regular INTEGER NOT NULL column with a plain index instead of the rowid PK. Re-sequencing UPDATEs no longer relocate rows. All joins/FTS already key on the _subset_ value, not the physical rowid, so behavior is unchanged.
Lazy, gated migration: legacy v0 indexes (where _subset_ is the PK) rebuild to v1 once on the first delete; PRAGMA user_version makes the check O(1) thereafter. Pure row shuffle, no re-encoding.
Floor win: replace the COUNT(*) re-sequence guard with an O(1) MAX(_subset_).

Results (main → this branch)

Rows / DB	position	main	new	speedup
50K / 1.33 GB	recent (end)	1379 ms	8 ms	178×
	middle	4189 ms	2403 ms	1.7×
	oldest (front)	10840 ms	4825 ms	2.2×
20K / 533 MB	recent	504 ms	5 ms	108×
50K / 256 MB (4 KB rows)	recent	452 ms	4 ms	102×

The common incremental-update case (deleting recently-indexed content) is now single-digit ms — ~100–180×. Worst case (re-indexing the oldest file, shifting every row) is ~2–3×; that's the ceiling for staying forward-compatible, since SQLite re-reads each shifted row's payload on UPDATE.

Compatibility

Forward: SELECT * columns are byte-identical to v0, so a deployed next-plaid-api reads/searches/mutates a v1 index unchanged.
Backward: v0 indexes are read directly and migrated to v1 on first delete.

Tests

Unit: dense re-sequencing, v0→v1 migration, forward-compat column identity; full next-plaid suite green (557 tests via ci-quick).
Reproducible benchmark added as an #[ignore]d test (metadata_delete_bench).
Cross-version + colgrep/api stress validation in the PR thread.

Refs #136

…nt (#136) Deleting a single file from a large fat metadata DB took seconds because `_subset_` is the `INTEGER PRIMARY KEY` (rowid): re-sequencing it on delete relocates every shifted row in the table b-tree, rewriting overflow pages for large TEXT columns. A `COUNT(*)` guard added a full-table scan on top. This keeps every column physically in METADATA (so a deployed next-plaid-api reads the index unchanged — `SELECT *` output is byte-identical) while removing the hotspot: - v1 layout: `_subset_` becomes a regular `INTEGER NOT NULL` column with a plain index instead of the rowid PK. Re-sequencing `UPDATE`s no longer relocate rows in the b-tree. All joins/FTS already key on the `_subset_` value, not the physical rowid, so behavior is unchanged. - Lazy, gated migration: legacy v0 indexes (where `_subset_` is the PK) rebuild to v1 once on the first delete; `PRAGMA user_version` makes the check O(1) thereafter. Pure row shuffle, no re-encoding. - Replace the `COUNT(*)` re-sequence guard with an O(1) `MAX(_subset_)`. Measured (single-file delete, code~24KB/row): - localized / recent deletes (high `_subset_`): ~100-180x faster (e.g. 50K rows 1.33GB: 1379ms -> 8ms) -- the common incremental-update case is now instant. - worst case (delete oldest content, shifts ~all rows): ~2-3x (10.8s -> 4.8s). Bounded by SQLite re-reading each shifted row's payload on UPDATE; going further needs a thin/fat split, which would break forward compatibility. Tests: dense re-sequencing, v0->v1 migration, and forward-compat column identity; full next-plaid suite green. Heavy benchmark added as an #[ignore]d test. Refs #136

raphaelsty · 2026-06-17T13:26:15Z

Stress test results — deletion & incremental updates, both surfaces, both versions

Validated that the v1 metadata layout works correctly under heavy add/delete/update churn and stays forward/backward compatible with indexes from a deployed next-plaid-api. Model: lightonai/answerai-colbert-small-v1-onnx.

1. colgrep end-to-end — 19/19 ✅

Real repo: index → search → delete files → re-index → edit → add → second delete wave. Keyword-search correctness asserted at every step (+ semantic-search smoke check). Deletion, _subset_ re-sequencing, FTS5 pruning, and incremental updates all stay correct across multiple churn waves.

2. next-plaid-api — single-version heavy churn (v1 image) — 7/7 ops ✅

SciFact add/delete/re-add cycle, counts exact at every step:

step	op	expected	got
1	ADD 1000	1000	1000
2	DELETE 200	800	800
3	ADD 200	1000	1000
4	ADD 1000	2000	2000
5	DELETE 500	1500	1500
6	ADD 2983	4483	4483
7	ADD 700	5183	5183

Final 5183 docs. Scores MAP 0.698 / NDCG@10 0.737 / Recall@100 0.948 — identical to main, no regression.

3. Cross-version — backward compat (OLD v0 → NEW) — 9/9 ✅

OLD (origin/main, v0 layout) builds + populates a full SciFact index → NEW reads it → NEW deletes half, triggering the v0→v1 migration → NEW re-adds. Counts exact, metadata-filtered search correct (0 violations), retrieval scores within 0.0016 of the OLD baseline.

4. Cross-version — forward compat (NEW v1 → OLD) ✅

NEW builds a v1 index → OLD (deployed image) then reads, metadata-filters, deletes from, and adds to it — all correct. Decisive apples-to-apples check: OLD and NEW evaluating the same v1 index produce byte-identical scores (max drift 0.00000, identical counts). A deployed next-plaid-api searches/mutates a v1 index exactly like the new code.

Verdict

Deletion and incremental updates are correct under churn on both colgrep and next-plaid-api. The new layout is fully forward- and backward-compatible: old binaries read/search/mutate v1 indexes unchanged, and legacy v0 indexes migrate transparently on first delete. Nothing breaks.

vlasky · 2026-06-17T23:37:28Z

Nice work on the forward-compat approach. I left some follow-up benchmarks on #136 showing a remaining 26x gap between demoted-PK and thin-table for worst-case deletes (files indexed early in the project). Might be worth a look if you're considering a v2 layout down the line.

raphaelsty merged commit 8889923 into main Jun 17, 2026
20 checks passed

raphaelsty mentioned this pull request Jun 17, 2026

perf(filtering): split METADATA into thin index + fat content table #136

Open

vlasky mentioned this pull request Jun 18, 2026

perf(filtering): split METADATA into thin + fat tables (v2 schema) #144

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(filtering): fat-row-independent metadata delete re-sequencing (#136)#141

perf(filtering): fat-row-independent metadata delete re-sequencing (#136)#141
raphaelsty merged 1 commit into
mainfrom
perf/metadata-fast-delete

raphaelsty commented Jun 17, 2026

Uh oh!

raphaelsty commented Jun 17, 2026

Uh oh!

Uh oh!

vlasky commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raphaelsty commented Jun 17, 2026

Problem

Approach (compatible)

Results (main → this branch)

Compatibility

Tests

Uh oh!

raphaelsty commented Jun 17, 2026

Stress test results — deletion & incremental updates, both surfaces, both versions

1. colgrep end-to-end — 19/19 ✅

2. next-plaid-api — single-version heavy churn (v1 image) — 7/7 ops ✅

3. Cross-version — backward compat (OLD v0 → NEW) — 9/9 ✅

4. Cross-version — forward compat (NEW v1 → OLD) ✅

Verdict

Uh oh!

Uh oh!

vlasky commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants