Problem
The filtering::delete function re-sequences _subset_ primary keys after removing rows to maintain the dense 0-based ID invariant required by the PLAID vector index. On large indices (~90K rows, ~5GB database), this takes ~1s even for single-file updates.
The root cause is that _subset_ is the INTEGER PRIMARY KEY (i.e. the rowid). Changing it relocates the row in the B-tree, which requires reading and rewriting all ~27KB per row (including overflow pages holding large TEXT columns like code, signature, etc).
Benchmarks on a 20K-row table with 27KB rows, shifting ~19K rows:
| Approach |
Time |
_subset_ as PK (current) |
0.94s |
Autoincrement PK + indexed _subset_ |
0.21s |
| Thin table (no fat columns) |
0.008s |
The thin table is ~115x faster than the current approach because SQLite never touches overflow pages.
Proposal
Split METADATA into two tables:
METADATA (thin, ~5-10MB for 90K rows):
_subset_ INTEGER PRIMARY KEY, -- dense 0-based, re-sequenced on delete
_content_id_ INTEGER NOT NULL, -- FK to METADATA_CONTENT (stable, monotonic)
file TEXT,
name TEXT,
qualified_name TEXT,
line INTEGER,
end_line INTEGER,
language TEXT,
unit_type TEXT,
complexity INTEGER,
has_loops INTEGER,
has_branches INTEGER,
has_error_handling INTEGER
METADATA_CONTENT (fat, ~2.5GB for 90K rows):
_content_id_ INTEGER PRIMARY KEY, -- stable monotonic, never re-sequenced
code TEXT,
signature TEXT,
docstring TEXT,
parameters TEXT,
calls TEXT,
called_by TEXT,
variables TEXT,
imports TEXT,
return_type TEXT,
extends TEXT,
parent_class TEXT
Impact on public API
The public interface (create, update, delete, get, where_condition) stays the same. Internal changes only:
| Function |
Change |
create / update |
INSERT into both tables |
delete |
Re-sequence thin table only (~8ms); cascade-delete orphaned content rows |
get |
JOIN on _content_id_ |
where_condition |
No change (filters on thin columns only) |
where_condition_regexp |
JOIN when pattern targets a content column |
count / exists |
No change |
Migration
Use PRAGMA user_version to track schema version:
- v0: current single-table layout
- v1: split layout
On first load with new code, detect v0 and run a one-time migration:
- CREATE METADATA_CONTENT, populate from existing fat columns
- Rebuild METADATA as thin table (CREATE new, INSERT, DROP old, RENAME)
- Set
PRAGMA user_version = 1
- VACUUM
Pure row shuffling, no re-encoding needed. ~30-60s on a 5GB database. Transaction ensures no partial state on interruption.
Context
Complements PR #135 which addresses the other half of incremental delete performance (IVF patching). Together these would bring a small incremental update from ~60s to well under 1s on large indices.
Problem
The
filtering::deletefunction re-sequences_subset_primary keys after removing rows to maintain the dense 0-based ID invariant required by the PLAID vector index. On large indices (~90K rows, ~5GB database), this takes ~1s even for single-file updates.The root cause is that
_subset_is theINTEGER PRIMARY KEY(i.e. the rowid). Changing it relocates the row in the B-tree, which requires reading and rewriting all ~27KB per row (including overflow pages holding large TEXT columns likecode,signature, etc).Benchmarks on a 20K-row table with 27KB rows, shifting ~19K rows:
_subset_as PK (current)_subset_The thin table is ~115x faster than the current approach because SQLite never touches overflow pages.
Proposal
Split METADATA into two tables:
METADATA (thin, ~5-10MB for 90K rows):
METADATA_CONTENT (fat, ~2.5GB for 90K rows):
Impact on public API
The public interface (
create,update,delete,get,where_condition) stays the same. Internal changes only:create/updatedeleteget_content_id_where_conditionwhere_condition_regexpcount/existsMigration
Use
PRAGMA user_versionto track schema version:On first load with new code, detect v0 and run a one-time migration:
PRAGMA user_version = 1Pure row shuffling, no re-encoding needed. ~30-60s on a 5GB database. Transaction ensures no partial state on interruption.
Context
Complements PR #135 which addresses the other half of incremental delete performance (IVF patching). Together these would bring a small incremental update from ~60s to well under 1s on large indices.