Skip to content

fix(colgrep): eliminate unnecessary re-indexing on every search (integrates #134)#137

Merged
raphaelsty merged 1 commit into
mainfrom
fix/incremental-indexing-sync
Jun 17, 2026
Merged

fix(colgrep): eliminate unnecessary re-indexing on every search (integrates #134)#137
raphaelsty merged 1 commit into
mainfrom
fix/incremental-indexing-sync

Conversation

@raphaelsty

Copy link
Copy Markdown
Collaborator

Summary

Integrates #134 (by @vlasky) — which eliminates the unnecessary re-indexing that made every search re-hash every tracked file on large projects — with two correctness fixes and dedicated tests. Targets main.

From #134 (kept)

  • Skip redundant canonicalize() in scan_files for paths without ... The walker uses follow_links(false) and filters on entry.file_type().is_file(), so a symlink can't escape the root or even reach this check — the per-file realpath was dead weight.
  • Migrate index format 0 → 1 in place (stat to backfill size, upgrade legacy second-precision mtimes to nanoseconds) instead of discarding the index and re-embedding everything. The on-disk layout did not change across 1.5.4→1.5.5, so the rebuild was pure waste (minutes on large repos).
  • Refresh file stats after worktree seeding so git checkout's "now" mtimes don't defeat the fast path on a new worktree's first search.

Corrections to #134

  • Dropped the size == 0 "mtime-only" fast-path tolerance. It reopened the same-second-edit blind spot (an edit landing in the same instant that the size guard was added to catch). The fast path now requires a strict mtime and size match; legacy size==0 entries are hashed once (correct) and then backfilled by the migration / touched-persist.
  • Stopped purging stat-missing entries from state during migration. Leaving them lets the normal incremental-delete path remove them from every store (vector + metadata + FTS5) deterministically, rather than dropping them from tracking and relying on periodic orphan cleanup.

Tests (new)

  • test_index_stays_synced_with_disk — the core invariant: a file deleted on disk is detected by the update plan, purged from every store (no longer keyword-retrievable), survivors remain, and a content change is detected as changed.
  • test_format_migration_in_place_cleans_deleted_without_rebuild — proves the migration reuses the legacy index (no model in the test ⇒ a full rebuild would fail to embed) and ends in sync with disk.

Validation

  • cargo clippy -p colgrep --all-targets — clean
  • cargo test -p colgrep — 557 passed

Co-authored with @vlasky (#134).

Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: vlasky <vlad.lasky@energyone.com>
@raphaelsty raphaelsty merged commit e2906d9 into main Jun 17, 2026
30 of 31 checks passed
raphaelsty added a commit that referenced this pull request Jun 17, 2026
Integrates #134 (by @vlasky) — which fixes several issues that made every
search re-hash every tracked file on large projects — with corrections.

From #134 (kept):
- Skip redundant canonicalize() in scan_files for paths without '..' (the
  walker's follow_links(false) + is_file() filter already block symlink escape).
- Migrate index format 0 -> 1 in place (stat to backfill sizes, upgrade legacy
  second-precision mtimes to nanoseconds) instead of discarding and re-embedding
  the whole index — the on-disk layout did not change across 1.5.4 -> 1.5.5.
- Refresh file stats after worktree seeding so git-checkout's 'now' mtimes don't
  defeat the fast path on first search.

Corrections:
- Drop the size==0 'mtime-only' fast-path tolerance: it reopened the same-second
  edit blind spot. Require a strict mtime+size match; legacy size==0 entries are
  hashed once (correct) and then backfilled by the migration / touched-persist.
- Stop purging stat-missing entries from state during migration; leave them so
  the normal incremental-delete path removes them from every store (vector +
  metadata + FTS5) deterministically, instead of relying on periodic orphan cleanup.

Tests:
- test_index_stays_synced_with_disk: deleting files on disk is detected, purges
  them from every store (not keyword-retrievable), keeps survivors, and a content
  change is detected as changed.
- test_format_migration_in_place_cleans_deleted_without_rebuild: migration reuses
  the legacy index (no model => a rebuild would fail) and ends in sync with disk.

Validated: clippy clean; cargo test -p colgrep (557 passed).

Co-authored-by: Raphael Sourty <raphael.sourty@lighton.ai>
Co-authored-by: Vlad Lasky <12727610+vlasky@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant