fix(colgrep): eliminate unnecessary re-indexing on every search#134
Closed
vlasky wants to merge 4 commits into
Closed
fix(colgrep): eliminate unnecessary re-indexing on every search#134vlasky wants to merge 4 commits into
vlasky wants to merge 4 commits into
Conversation
added 3 commits
June 17, 2026 09:57
The walker uses follow_links(false), so any path it produces without ".." is guaranteed to be within the project root. The per-file canonicalize (two realpath syscalls per file) was added alongside follow_links(false) in 6f94b42 as belt-and-suspenders, but provides no safety benefit when symlinks are not followed. On large projects this eliminates thousands of unnecessary syscalls per search.
States written before the size field was added deserialize with size: 0 (via #[serde(default)]). The strict equality check `info.size == current_size` always fails for these entries, defeating the fast path and causing every file to be content-hashed on every search. Treat size == 0 as "unknown" and fall back to mtime-only comparison. While real files can have zero size (e.g. semaphore files), such files have no content to index so a false-positive skip is harmless to colgrep. This restores the fast path for any index state written by <= 1.5.4, eliminating O(n) hash_file calls per search when no files have changed.
…uild The 0 → 1 format version transition added version tracking and the FileInfo.size field but did not change the on-disk index layout (chunk format, embedding pipeline, metadata schema). A full rebuild discards all embeddings and re-processes every file, which is very expensive for large projects. Instead, bump the version in state.json and save. The incremental update that follows will naturally fill in correct sizes via the self-healing touched-file path. Future format versions that change the actual index layout will still trigger a full rebuild.
32fe8a7 to
11c7ca4
Compare
When seeding a new worktree's index from a sibling, the copied state carries the source worktree's mtimes. Since git checkout stamps files with the current time, every mtime in the copied state mismatches the local files, causing the fast path to miss and triggering a full content-hash pass on the first search. After copying the source state, stat each tracked file in the current worktree and update its mtime and size while preserving the content hash (which remains valid as both worktrees share the same git objects). This makes the fast path effective immediately after seeding. Also enhances the format version 0 → 1 migration to stat all entries, filling in missing sizes and purging entries for files that no longer exist on disk.
11c7ca4 to
422ef1e
Compare
raphaelsty
pushed a commit
that referenced
this pull request
Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: vlasky <vlasky@users.noreply.github.com> Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>
raphaelsty
pushed a commit
that referenced
this pull request
Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>
raphaelsty
added a commit
that referenced
this pull request
Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: vlasky <vlad.lasky@energyone.com>
Collaborator
raphaelsty
added a commit
that referenced
this pull request
Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every search re-hash every tracked file on large projects — with corrections. From #134 (kept): - Skip redundant canonicalize() in scan_files for paths without '..' (the walker's follow_links(false) + is_file() filter already block symlink escape). - Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding and re-embedding the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5. - Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't defeat the fast path on first search. Corrections: - Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second edit blind spot. Require a strict mtime+size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist. - Stop purging stat-missing entries from state during migration; leave them so the normal incremental-delete path removes them from every store (vector + metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup. Tests: - test_index_stays_synced_with_disk: deleting files on disk is detected, purges them from every store (not keyword-retrievable), keeps survivors, and a content change is detected as changed. - test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses the legacy index (no model => a rebuild would fail) and ends in sync with disk. Validated: clippy clean; cargo test -p colgrep (557 passed). Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai> Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ColGREP's incremental indexing has several compounding issues that cause every search to re-hash every tracked file, even when nothing has changed. In our case (a Bitbucket directory with ~8,400 files across many repos and git worktrees), every search was reading and hashing all 8,400 files before returning results.
These fixes are small and focused (one file changed) but have a significant impact on usability. We'd appreciate a timely review as the current behavior makes colgrep impractical for large project directories.
The problems
Legacy states defeat the mtime fast path: States written by <= 1.5.4 have
size: 0(field didn't exist). The fast path requiresinfo.size == current_size, which always fails wheninfo.size == 0, causing every file to be content-hashed on every search.Format version 0 → 1 triggers a full rebuild: Upgrading to 1.5.5 nukes the entire index and re-embeds all files from scratch, even though the on-disk format didn't actually change. For large projects this takes many minutes.
Legacy mtime precision mismatch: States written by <= 1.5.4 store mtimes in seconds, but the current fast path compares at nanosecond precision. The comparison always fails, defeating the fast path even after the format migration.
Worktree seeding copies stale mtimes: When seeding a new worktree from a sibling, the source's mtimes are persisted verbatim. Since git checkout stamps files with "now", every mtime mismatches on first search, triggering a full content-hash pass.
Redundant
canonicalize()syscalls:is_within_project_rootcallsrealpath(3)on every file during scan, despitefollow_links(false)already preventing symlink escape. This adds ~2N unnecessary syscalls.The fixes (one per commit)
perf(index): remove redundant canonicalize in scan_files: Skipcanonicalize()when the relative path has no..; fall back to full validation only for the rare case where it does.fix(index): tolerate size == 0 in mtime fast path for legacy states: Treatsize == 0as "unknown (legacy)" and use mtime-only comparison. This immediately restores the fast path for all pre-1.5.5 states.fix(index): migrate format version 0 → 1 in place instead of full rebuild: Bump the version in state.json without discarding the index. Future genuinely-incompatible format changes still trigger a full rebuild.fix(index): refresh file stats after worktree seeding: After copying the source state, stat each tracked file in the current worktree to stamp correct local mtimes/sizes before persisting. Also enhances the format 0 → 1 migration to stat all entries: fills in missing sizes, upgrades second-precision mtimes to nanoseconds (when the file hasn't been modified since last indexing), and purges entries for files that no longer exist on disk.Testing
Impact without these fixes
On a project with N tracked files, every single search performs N file reads + N hash computations before returning any results. With worktrees, this penalty is paid independently for each worktree on first use. On upgrade to 1.5.5, the entire index is additionally discarded and rebuilt from scratch.