Skip to content

perf(filtering): split METADATA into thin index + fat content table #136

@vlasky

Description

@vlasky

Problem

The filtering::delete function re-sequences _subset_ primary keys after removing rows to maintain the dense 0-based ID invariant required by the PLAID vector index. On large indices (~90K rows, ~5GB database), this takes ~1s even for single-file updates.

The root cause is that _subset_ is the INTEGER PRIMARY KEY (i.e. the rowid). Changing it relocates the row in the B-tree, which requires reading and rewriting all ~27KB per row (including overflow pages holding large TEXT columns like code, signature, etc).

Benchmarks on a 20K-row table with 27KB rows, shifting ~19K rows:

Approach Time
_subset_ as PK (current) 0.94s
Autoincrement PK + indexed _subset_ 0.21s
Thin table (no fat columns) 0.008s

The thin table is ~115x faster than the current approach because SQLite never touches overflow pages.

Proposal

Split METADATA into two tables:

METADATA (thin, ~5-10MB for 90K rows):

_subset_ INTEGER PRIMARY KEY,      -- dense 0-based, re-sequenced on delete
_content_id_ INTEGER NOT NULL,     -- FK to METADATA_CONTENT (stable, monotonic)
file TEXT,
name TEXT,
qualified_name TEXT,
line INTEGER,
end_line INTEGER,
language TEXT,
unit_type TEXT,
complexity INTEGER,
has_loops INTEGER,
has_branches INTEGER,
has_error_handling INTEGER

METADATA_CONTENT (fat, ~2.5GB for 90K rows):

_content_id_ INTEGER PRIMARY KEY,  -- stable monotonic, never re-sequenced
code TEXT,
signature TEXT,
docstring TEXT,
parameters TEXT,
calls TEXT,
called_by TEXT,
variables TEXT,
imports TEXT,
return_type TEXT,
extends TEXT,
parent_class TEXT

Impact on public API

The public interface (create, update, delete, get, where_condition) stays the same. Internal changes only:

Function Change
create / update INSERT into both tables
delete Re-sequence thin table only (~8ms); cascade-delete orphaned content rows
get JOIN on _content_id_
where_condition No change (filters on thin columns only)
where_condition_regexp JOIN when pattern targets a content column
count / exists No change

Migration

Use PRAGMA user_version to track schema version:

  • v0: current single-table layout
  • v1: split layout

On first load with new code, detect v0 and run a one-time migration:

  1. CREATE METADATA_CONTENT, populate from existing fat columns
  2. Rebuild METADATA as thin table (CREATE new, INSERT, DROP old, RENAME)
  3. Set PRAGMA user_version = 1
  4. VACUUM

Pure row shuffling, no re-encoding needed. ~30-60s on a 5GB database. Transaction ensures no partial state on interruption.

Context

Complements PR #135 which addresses the other half of incremental delete performance (IVF patching). Together these would bring a small incremental update from ~60s to well under 1s on large indices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions