Skip to content

scan: chunk_id not fully content-addressed — several scanners re-embed identical content #2

@scottmeyer

Description

@scottmeyer

Summary

Several scanners produce non-deterministic chunk_ids. Identical input produces a different chunk_id on every scan, which bypasses the content-sha dedup in IngestDb::content_already_ingested (pipeline/src/lib.rs:132) and forces re-embedding on every run.

Affected scanners in my corpus (reported counts are stable across back-to-back scans of unchanged input):

Source chunks upserts per scan
ostk-site (markdown) 124 59
fcp (markdown) 601 8
molva 790 9
needle-bench 259 17
samizdat-mesh 2301 22
mish (docs) 219 27

Truly idempotent scanners (0 upserts on second scan): osteak, ostk.ai.discord, reference, taynik, fcp (code), mish (code). So the pattern isn't universal — something specific about these emitters.

Repro

Back-to-back ostk-recall scan on the same corpus, nothing touched in between. Compare scan summary per-source upserted counts. Deterministic scanners go to 0; broken ones report an identical non-zero count every time.

Evidence

Run 1 (cold):

total: items=6556 chunks=466170 upserted=45510 dup=420552

Run 2 (immediately after, no content change):

total: items=6558 chunks=466450 upserted=14533 dup=451873

The residual 14533 upserts on run 2 decompose to:

  • ~14,300 legitimate new content (haystack audit rows landing live + 2 new claude-code-history session files)
  • ~230 from the broken scanners listed above, which re-upsert the same chunks

Why it matters

  • Every scan re-embeds content that hasn't changed → wasted CPU/GPU time
  • Scan wall time: ~60 min cold → ~48 min warm (20% faster, not the ~5% you'd expect from pure dedup)
  • Blocks any sensible scheduling (launchd / watcher / post-commit hook) because "idempotent re-run" isn't.

Likely cause

Scanner's chunk_id computation probably mixes in something ephemeral (wall-clock ts, scan-session uuid, monotonic counter, or a path component that includes a run-scoped dir). The content-hash sha256 is stable; the chunk_id changes → the AND in content_already_ingested(chunk_id, sha256) fails → cache miss → re-embed.

Fix direction

chunk_id should be a deterministic function of (source, source_id, chunk_index, content_sha256) or similar purely-content fields. Audit each scanner's emit path for any non-determinism.

Test to add

A property test: run the ingest pipeline twice on a fixed fixture tree; assert that ingest_runs[-1].chunks_upserted == 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions