Skip to content

perf(filtering): split METADATA into thin + fat tables (v2 schema)#144

Open
vlasky wants to merge 1 commit into
lightonai:mainfrom
vlasky:perf/metadata-thin-fat-split
Open

perf(filtering): split METADATA into thin + fat tables (v2 schema)#144
vlasky wants to merge 1 commit into
lightonai:mainfrom
vlasky:perf/metadata-thin-fat-split

Conversation

@vlasky

@vlasky vlasky commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Splits the single METADATA table into a thin table (~50 bytes/row) and a fat METADATA_CONTENT table (large TEXT columns), linked by a stable monotonic _content_id_. Re-sequencing after deletes now touches only the thin table, making performance position-independent.

Benchmarks (20K rows, 27KB each, delete near position 0):

Design

  • METADATA retains _subset_, _content_id_ FK, and small filterable columns (file, name, line, unit_type, language, complexity, booleans)
  • METADATA_CONTENT holds large TEXT columns (code, signature, docstring, parameters, calls, etc.) keyed by _content_id_ INTEGER PRIMARY KEY
  • _content_id_ is stable and monotonic - never re-sequenced on delete
  • PRAGMA user_version = 2 marks the new layout
  • Orphaned content rows are cleaned up on delete

Backward compatibility

  • v0/v1 read and write paths preserved unchanged - existing indices work with no migration
  • INDEX_FORMAT_VERSION bumped to 2 in colgrep, triggering a full rebuild on upgrade (no in-place migration needed)
  • Removed the v0->v1 in-place migration path; format mismatches now trigger full rebuild

Changes

  • next-plaid/src/filtering.rs: v2 create, update_v2, delete_v2, get_v2, update_where_v2, JOIN-based where_condition/where_condition_regexp for fat column queries
  • next-plaid/src/text_search.rs: FTS rebuild()/update_rows() adapted to JOIN for v2
  • colgrep/src/index/state.rs: INDEX_FORMAT_VERSION: 1 -> 2
  • colgrep/src/index/mod.rs: simplified format mismatch handling to full rebuild

Related

Test plan

  • 130 unit tests in next-plaid pass (including 6 new v2-specific tests)
  • 619 tests in colgrep pass
  • 54 tests in next-plaid-api pass
  • Integration tested on 3,722-file real-world index (search, incremental update, format-version-triggered rebuild)
  • Stress tested: delete at all positions, concurrent readers+writer, adversarial inputs (1MB blobs, Unicode, NUL bytes, SQL injection attempts, out-of-range IDs)

Re-sequencing after deletes now touches only a ~50 byte/row thin table
instead of rewriting full ~27KB rows. Benchmarked at 8ms for 20K rows
vs 210ms (v1) / 940ms (v0).

- METADATA retains small filterable columns (_subset_, file, name, etc.)
- METADATA_CONTENT holds large TEXT columns (code, signature, imports)
- Linked by stable monotonic _content_id_ (never re-sequenced)
- v0/v1 code paths preserved for backward compat reads
- FTS rebuild/update_rows adapted to JOIN for v2 layout
- INDEX_FORMAT_VERSION bumped to 2 (triggers full rebuild on upgrade)
- Removed v0->v1 in-place migration; format mismatches now full-rebuild
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(filtering): split METADATA into thin index + fat content table

1 participant