fix(colgrep): eliminate unnecessary re-indexing on every search by vlasky · Pull Request #134 · lightonai/next-plaid

vlasky · 2026-06-17T00:24:43Z

Summary

ColGREP's incremental indexing has several compounding issues that cause every search to re-hash every tracked file, even when nothing has changed. In our case (a Bitbucket directory with ~8,400 files across many repos and git worktrees), every search was reading and hashing all 8,400 files before returning results.

These fixes are small and focused (one file changed) but have a significant impact on usability. We'd appreciate a timely review as the current behavior makes colgrep impractical for large project directories.

The problems

Legacy states defeat the mtime fast path: States written by <= 1.5.4 have size: 0 (field didn't exist). The fast path requires info.size == current_size, which always fails when info.size == 0, causing every file to be content-hashed on every search.
Format version 0 → 1 triggers a full rebuild: Upgrading to 1.5.5 nukes the entire index and re-embeds all files from scratch, even though the on-disk format didn't actually change. For large projects this takes many minutes.
Legacy mtime precision mismatch: States written by <= 1.5.4 store mtimes in seconds, but the current fast path compares at nanosecond precision. The comparison always fails, defeating the fast path even after the format migration.
Worktree seeding copies stale mtimes: When seeding a new worktree from a sibling, the source's mtimes are persisted verbatim. Since git checkout stamps files with "now", every mtime mismatches on first search, triggering a full content-hash pass.
Redundant canonicalize() syscalls: is_within_project_root calls realpath(3) on every file during scan, despite follow_links(false) already preventing symlink escape. This adds ~2N unnecessary syscalls.

The fixes (one per commit)

perf(index): remove redundant canonicalize in scan_files: Skip canonicalize() when the relative path has no ..; fall back to full validation only for the rare case where it does.
fix(index): tolerate size == 0 in mtime fast path for legacy states: Treat size == 0 as "unknown (legacy)" and use mtime-only comparison. This immediately restores the fast path for all pre-1.5.5 states.
fix(index): migrate format version 0 → 1 in place instead of full rebuild: Bump the version in state.json without discarding the index. Future genuinely-incompatible format changes still trigger a full rebuild.
fix(index): refresh file stats after worktree seeding: After copying the source state, stat each tracked file in the current worktree to stamp correct local mtimes/sizes before persisting. Also enhances the format 0 → 1 migration to stat all entries: fills in missing sizes, upgrades second-precision mtimes to nanoseconds (when the file hasn't been modified since last indexing), and purges entries for files that no longer exist on disk.

Testing

All 612 existing unit tests pass.
Stress-tested against a 100-file git repo with worktrees: verified the fast path hits correctly, format migration upgrades mtimes/sizes and purges stale entries without rebuild, worktree seeding produces correct local mtimes, and modifications are still detected.
Applied to our production index (~8,400 files): search now returns results immediately instead of re-hashing everything.

Impact without these fixes

On a project with N tracked files, every single search performs N file reads + N hash computations before returning any results. With worktrees, this penalty is paid independently for each worktree on first use. On upgrade to 1.5.5, the entire index is additionally discarded and rebuilt from scratch.

The walker uses follow_links(false), so any path it produces without ".." is guaranteed to be within the project root. The per-file canonicalize (two realpath syscalls per file) was added alongside follow_links(false) in 6f94b42 as belt-and-suspenders, but provides no safety benefit when symlinks are not followed. On large projects this eliminates thousands of unnecessary syscalls per search.

States written before the size field was added deserialize with size: 0 (via #[serde(default)]). The strict equality check `info.size == current_size` always fails for these entries, defeating the fast path and causing every file to be content-hashed on every search. Treat size == 0 as "unknown" and fall back to mtime-only comparison. While real files can have zero size (e.g. semaphore files), such files have no content to index so a false-positive skip is harmless to colgrep. This restores the fast path for any index state written by <= 1.5.4, eliminating O(n) hash_file calls per search when no files have changed.

…uild The 0 → 1 format version transition added version tracking and the FileInfo.size field but did not change the on-disk index layout (chunk format, embedding pipeline, metadata schema). A full rebuild discards all embeddings and re-processes every file, which is very expensive for large projects. Instead, bump the version in state.json and save. The incremental update that follows will naturally fill in correct sizes via the self-healing touched-file path. Future format versions that change the actual index layout will still trigger a full rebuild.

When seeding a new worktree's index from a sibling, the copied state carries the source worktree's mtimes. Since git checkout stamps files with the current time, every mtime in the copied state mismatches the local files, causing the fast path to miss and triggering a full content-hash pass on the first search. After copying the source state, stat each tracked file in the current worktree and update its mtime and size while preserving the content hash (which remains valid as both worktrees share the same git objects). This makes the fast path effective immediately after seeding. Also enhances the format version 0 → 1 migration to stat all entries, filling in missing sizes and purging entries for files that no longer exist on disk.

@vlasky

Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: vlasky <vlasky@users.noreply.github.com> Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>

@vlasky

Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>

@vlasky

Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>

raphaelsty · 2026-06-17T09:50:40Z

Hi @vlasky, thank you for this MR, I merged it within #137 and apply few updates

@vlasky

Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>

Vlad Lasky added 3 commits June 17, 2026 09:57

vlasky force-pushed the fix/incremental-indexing-perf branch from 32fe8a7 to 11c7ca4 Compare June 17, 2026 00:46

vlasky force-pushed the fix/incremental-indexing-perf branch from 11c7ca4 to 422ef1e Compare June 17, 2026 00:57

raphaelsty mentioned this pull request Jun 17, 2026

fix(colgrep): eliminate unnecessary re-indexing on every search (integrates #134) #137

Merged

raphaelsty closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(colgrep): eliminate unnecessary re-indexing on every search#134

fix(colgrep): eliminate unnecessary re-indexing on every search#134
vlasky wants to merge 4 commits into
lightonai:mainfrom
vlasky:fix/incremental-indexing-perf

vlasky commented Jun 17, 2026 •

edited

Loading

Uh oh!

raphaelsty commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vlasky commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problems

The fixes (one per commit)

Testing

Impact without these fixes

Uh oh!

raphaelsty commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vlasky commented Jun 17, 2026 •

edited

Loading