perf(filtering): split METADATA into thin + fat tables (v2 schema) by vlasky · Pull Request #144 · lightonai/next-plaid

vlasky · 2026-06-18T11:12:33Z

Summary

Splits the single METADATA table into a thin table (~50 bytes/row) and a fat METADATA_CONTENT table (large TEXT columns), linked by a stable monotonic _content_id_. Re-sequencing after deletes now touches only the thin table, making performance position-independent.

Benchmarks (20K rows, 27KB each, delete near position 0):

v0 (INTEGER PRIMARY KEY): 940ms
v1 (perf(filtering): fat-row-independent metadata delete re-sequencing (#136) #141, demoted PK): 211ms
v2 (this PR, thin table): 8ms - 26x faster than v1, 117x faster than v0

Design

METADATA retains _subset_, _content_id_ FK, and small filterable columns (file, name, line, unit_type, language, complexity, booleans)
METADATA_CONTENT holds large TEXT columns (code, signature, docstring, parameters, calls, etc.) keyed by _content_id_ INTEGER PRIMARY KEY
_content_id_ is stable and monotonic - never re-sequenced on delete
PRAGMA user_version = 2 marks the new layout
Orphaned content rows are cleaned up on delete

Backward compatibility

v0/v1 read and write paths preserved unchanged - existing indices work with no migration
INDEX_FORMAT_VERSION bumped to 2 in colgrep, triggering a full rebuild on upgrade (no in-place migration needed)
Removed the v0->v1 in-place migration path; format mismatches now trigger full rebuild

Changes

next-plaid/src/filtering.rs: v2 create, update_v2, delete_v2, get_v2, update_where_v2, JOIN-based where_condition/where_condition_regexp for fat column queries
next-plaid/src/text_search.rs: FTS rebuild()/update_rows() adapted to JOIN for v2
colgrep/src/index/state.rs: INDEX_FORMAT_VERSION: 1 -> 2
colgrep/src/index/mod.rs: simplified format mismatch handling to full rebuild

Test plan

130 unit tests in next-plaid pass (including 6 new v2-specific tests)
619 tests in colgrep pass
54 tests in next-plaid-api pass
Integration tested on 3,722-file real-world index (search, incremental update, format-version-triggered rebuild)
Stress tested: delete at all positions, concurrent readers+writer, adversarial inputs (1MB blobs, Unicode, NUL bytes, SQL injection attempts, out-of-range IDs)

Re-sequencing after deletes now touches only a ~50 byte/row thin table instead of rewriting full ~27KB rows. Benchmarked at 8ms for 20K rows vs 210ms (v1) / 940ms (v0). - METADATA retains small filterable columns (_subset_, file, name, etc.) - METADATA_CONTENT holds large TEXT columns (code, signature, imports) - Linked by stable monotonic _content_id_ (never re-sequenced) - v0/v1 code paths preserved for backward compat reads - FTS rebuild/update_rows adapted to JOIN for v2 layout - INDEX_FORMAT_VERSION bumped to 2 (triggers full rebuild on upgrade) - Removed v0->v1 in-place migration; format mismatches now full-rebuild

vlasky mentioned this pull request Jun 18, 2026

perf(filtering): split METADATA into thin index + fat content table #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(filtering): split METADATA into thin + fat tables (v2 schema)#144

perf(filtering): split METADATA into thin + fat tables (v2 schema)#144
vlasky wants to merge 1 commit into
lightonai:mainfrom
vlasky:perf/metadata-thin-fat-split

vlasky commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vlasky commented Jun 18, 2026

Summary

Design

Backward compatibility

Changes

Related

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant