Skip to content

fix(colgrep): eliminate unnecessary re-indexing on every search#134

Closed
vlasky wants to merge 4 commits into
lightonai:mainfrom
vlasky:fix/incremental-indexing-perf
Closed

fix(colgrep): eliminate unnecessary re-indexing on every search#134
vlasky wants to merge 4 commits into
lightonai:mainfrom
vlasky:fix/incremental-indexing-perf

Conversation

@vlasky

@vlasky vlasky commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

ColGREP's incremental indexing has several compounding issues that cause every search to re-hash every tracked file, even when nothing has changed. In our case (a Bitbucket directory with ~8,400 files across many repos and git worktrees), every search was reading and hashing all 8,400 files before returning results.

These fixes are small and focused (one file changed) but have a significant impact on usability. We'd appreciate a timely review as the current behavior makes colgrep impractical for large project directories.

The problems

  1. Legacy states defeat the mtime fast path: States written by <= 1.5.4 have size: 0 (field didn't exist). The fast path requires info.size == current_size, which always fails when info.size == 0, causing every file to be content-hashed on every search.

  2. Format version 0 → 1 triggers a full rebuild: Upgrading to 1.5.5 nukes the entire index and re-embeds all files from scratch, even though the on-disk format didn't actually change. For large projects this takes many minutes.

  3. Legacy mtime precision mismatch: States written by <= 1.5.4 store mtimes in seconds, but the current fast path compares at nanosecond precision. The comparison always fails, defeating the fast path even after the format migration.

  4. Worktree seeding copies stale mtimes: When seeding a new worktree from a sibling, the source's mtimes are persisted verbatim. Since git checkout stamps files with "now", every mtime mismatches on first search, triggering a full content-hash pass.

  5. Redundant canonicalize() syscalls: is_within_project_root calls realpath(3) on every file during scan, despite follow_links(false) already preventing symlink escape. This adds ~2N unnecessary syscalls.

The fixes (one per commit)

  • perf(index): remove redundant canonicalize in scan_files: Skip canonicalize() when the relative path has no ..; fall back to full validation only for the rare case where it does.

  • fix(index): tolerate size == 0 in mtime fast path for legacy states: Treat size == 0 as "unknown (legacy)" and use mtime-only comparison. This immediately restores the fast path for all pre-1.5.5 states.

  • fix(index): migrate format version 0 → 1 in place instead of full rebuild: Bump the version in state.json without discarding the index. Future genuinely-incompatible format changes still trigger a full rebuild.

  • fix(index): refresh file stats after worktree seeding: After copying the source state, stat each tracked file in the current worktree to stamp correct local mtimes/sizes before persisting. Also enhances the format 0 → 1 migration to stat all entries: fills in missing sizes, upgrades second-precision mtimes to nanoseconds (when the file hasn't been modified since last indexing), and purges entries for files that no longer exist on disk.

Testing

  • All 612 existing unit tests pass.
  • Stress-tested against a 100-file git repo with worktrees: verified the fast path hits correctly, format migration upgrades mtimes/sizes and purges stale entries without rebuild, worktree seeding produces correct local mtimes, and modifications are still detected.
  • Applied to our production index (~8,400 files): search now returns results immediately instead of re-hashing everything.

Impact without these fixes

On a project with N tracked files, every single search performs N file reads + N hash computations before returning any results. With worktrees, this penalty is paid independently for each worktree on first use. On upgrade to 1.5.5, the entire index is additionally discarded and rebuilt from scratch.

Vlad Lasky added 3 commits June 17, 2026 09:57
The walker uses follow_links(false), so any path it produces without
".." is guaranteed to be within the project root. The per-file
canonicalize (two realpath syscalls per file) was added alongside
follow_links(false) in 6f94b42 as belt-and-suspenders, but provides
no safety benefit when symlinks are not followed. On large projects
this eliminates thousands of unnecessary syscalls per search.
States written before the size field was added deserialize with size: 0
(via #[serde(default)]). The strict equality check `info.size ==
current_size` always fails for these entries, defeating the fast path
and causing every file to be content-hashed on every search.

Treat size == 0 as "unknown" and fall back to mtime-only comparison.
While real files can have zero size (e.g. semaphore files), such files
have no content to index so a false-positive skip is harmless to
colgrep.

This restores the fast path for any index state written by <= 1.5.4,
eliminating O(n) hash_file calls per search when no files have changed.
…uild

The 0 → 1 format version transition added version tracking and the
FileInfo.size field but did not change the on-disk index layout (chunk
format, embedding pipeline, metadata schema). A full rebuild discards
all embeddings and re-processes every file, which is very expensive for
large projects.

Instead, bump the version in state.json and save. The incremental
update that follows will naturally fill in correct sizes via the
self-healing touched-file path. Future format versions that change the
actual index layout will still trigger a full rebuild.
@vlasky vlasky force-pushed the fix/incremental-indexing-perf branch from 32fe8a7 to 11c7ca4 Compare June 17, 2026 00:46
When seeding a new worktree's index from a sibling, the copied state
carries the source worktree's mtimes. Since git checkout stamps files
with the current time, every mtime in the copied state mismatches the
local files, causing the fast path to miss and triggering a full
content-hash pass on the first search.

After copying the source state, stat each tracked file in the current
worktree and update its mtime and size while preserving the content
hash (which remains valid as both worktrees share the same git
objects). This makes the fast path effective immediately after seeding.

Also enhances the format version 0 → 1 migration to stat all entries,
filling in missing sizes and purging entries for files that no longer
exist on disk.
@vlasky vlasky force-pushed the fix/incremental-indexing-perf branch from 11c7ca4 to 422ef1e Compare June 17, 2026 00:57
raphaelsty pushed a commit that referenced this pull request Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: vlasky <vlasky@users.noreply.github.com>
Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: vlasky <vlad.lasky@energyone.com>
raphaelsty pushed a commit that referenced this pull request Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: vlasky <vlad.lasky@energyone.com>
raphaelsty added a commit that referenced this pull request Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: vlasky <vlad.lasky@energyone.com>
@raphaelsty

Copy link
Copy Markdown
Collaborator

Hi @vlasky, thank you for this MR, I merged it within #137 and apply few updates

@raphaelsty raphaelsty closed this Jun 17, 2026
raphaelsty added a commit that referenced this pull request Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants