Summary
Several scanners produce non-deterministic chunk_ids. Identical input produces a different chunk_id on every scan, which bypasses the content-sha dedup in IngestDb::content_already_ingested (pipeline/src/lib.rs:132) and forces re-embedding on every run.
Affected scanners in my corpus (reported counts are stable across back-to-back scans of unchanged input):
| Source |
chunks |
upserts per scan |
ostk-site (markdown) |
124 |
59 |
fcp (markdown) |
601 |
8 |
molva |
790 |
9 |
needle-bench |
259 |
17 |
samizdat-mesh |
2301 |
22 |
mish (docs) |
219 |
27 |
Truly idempotent scanners (0 upserts on second scan): osteak, ostk.ai.discord, reference, taynik, fcp (code), mish (code). So the pattern isn't universal — something specific about these emitters.
Repro
Back-to-back ostk-recall scan on the same corpus, nothing touched in between. Compare scan summary per-source upserted counts. Deterministic scanners go to 0; broken ones report an identical non-zero count every time.
Evidence
Run 1 (cold):
total: items=6556 chunks=466170 upserted=45510 dup=420552
Run 2 (immediately after, no content change):
total: items=6558 chunks=466450 upserted=14533 dup=451873
The residual 14533 upserts on run 2 decompose to:
- ~14,300 legitimate new content (
haystack audit rows landing live + 2 new claude-code-history session files)
- ~230 from the broken scanners listed above, which re-upsert the same chunks
Why it matters
- Every scan re-embeds content that hasn't changed → wasted CPU/GPU time
- Scan wall time: ~60 min cold → ~48 min warm (20% faster, not the ~5% you'd expect from pure dedup)
- Blocks any sensible scheduling (launchd / watcher / post-commit hook) because "idempotent re-run" isn't.
Likely cause
Scanner's chunk_id computation probably mixes in something ephemeral (wall-clock ts, scan-session uuid, monotonic counter, or a path component that includes a run-scoped dir). The content-hash sha256 is stable; the chunk_id changes → the AND in content_already_ingested(chunk_id, sha256) fails → cache miss → re-embed.
Fix direction
chunk_id should be a deterministic function of (source, source_id, chunk_index, content_sha256) or similar purely-content fields. Audit each scanner's emit path for any non-determinism.
Test to add
A property test: run the ingest pipeline twice on a fixed fixture tree; assert that ingest_runs[-1].chunks_upserted == 0.
Summary
Several scanners produce non-deterministic
chunk_ids. Identical input produces a differentchunk_idon every scan, which bypasses the content-sha dedup inIngestDb::content_already_ingested(pipeline/src/lib.rs:132) and forces re-embedding on every run.Affected scanners in my corpus (reported counts are stable across back-to-back scans of unchanged input):
ostk-site(markdown)fcp(markdown)molvaneedle-benchsamizdat-meshmish(docs)Truly idempotent scanners (0 upserts on second scan):
osteak,ostk.ai.discord,reference,taynik,fcp(code),mish(code). So the pattern isn't universal — something specific about these emitters.Repro
Back-to-back
ostk-recall scanon the same corpus, nothing touched in between. Comparescan summaryper-sourceupsertedcounts. Deterministic scanners go to 0; broken ones report an identical non-zero count every time.Evidence
Run 1 (cold):
Run 2 (immediately after, no content change):
The residual 14533 upserts on run 2 decompose to:
haystackaudit rows landing live + 2 newclaude-code-historysession files)Why it matters
Likely cause
Scanner's
chunk_idcomputation probably mixes in something ephemeral (wall-clock ts, scan-session uuid, monotonic counter, or a path component that includes a run-scoped dir). The content-hashsha256is stable; thechunk_idchanges → theANDincontent_already_ingested(chunk_id, sha256)fails → cache miss → re-embed.Fix direction
chunk_idshould be a deterministic function of(source, source_id, chunk_index, content_sha256)or similar purely-content fields. Audit each scanner's emit path for any non-determinism.Test to add
A property test: run the ingest pipeline twice on a fixed fixture tree; assert that
ingest_runs[-1].chunks_upserted == 0.