scan: chunk_id not fully content-addressed — several scanners re-embed identical content

## Summary

Several scanners produce non-deterministic `chunk_id`s. Identical input produces a different `chunk_id` on every scan, which bypasses the content-sha dedup in `IngestDb::content_already_ingested` (pipeline/src/lib.rs:132) and forces re-embedding on every run.

Affected scanners in my corpus (reported counts are stable across back-to-back scans of unchanged input):

| Source | chunks | upserts per scan |
|---|---:|---:|
| `ostk-site` (markdown) | 124 | 59 |
| `fcp` (markdown) | 601 | 8 |
| `molva` | 790 | 9 |
| `needle-bench` | 259 | 17 |
| `samizdat-mesh` | 2301 | 22 |
| `mish` (docs) | 219 | 27 |

Truly idempotent scanners (0 upserts on second scan): `osteak`, `ostk.ai.discord`, `reference`, `taynik`, `fcp` (code), `mish` (code). So the pattern isn't universal — something specific about these emitters.

## Repro

Back-to-back `ostk-recall scan` on the same corpus, nothing touched in between. Compare `scan summary` per-source `upserted` counts. Deterministic scanners go to 0; broken ones report an identical non-zero count every time.

## Evidence

Run 1 (cold):
```
total: items=6556 chunks=466170 upserted=45510 dup=420552
```

Run 2 (immediately after, no content change):
```
total: items=6558 chunks=466450 upserted=14533 dup=451873
```

The residual 14533 upserts on run 2 decompose to:
- ~14,300 legitimate new content (`haystack` audit rows landing live + 2 new `claude-code-history` session files)
- ~230 from the broken scanners listed above, which re-upsert the *same* chunks

## Why it matters

- Every scan re-embeds content that hasn't changed → wasted CPU/GPU time
- Scan wall time: ~60 min cold → ~48 min warm (20% faster, not the ~5% you'd expect from pure dedup)
- Blocks any sensible scheduling (launchd / watcher / post-commit hook) because "idempotent re-run" isn't.

## Likely cause

Scanner's `chunk_id` computation probably mixes in something ephemeral (wall-clock ts, scan-session uuid, monotonic counter, or a path component that includes a run-scoped dir). The content-hash `sha256` is stable; the `chunk_id` changes → the `AND` in `content_already_ingested(chunk_id, sha256)` fails → cache miss → re-embed.

## Fix direction

`chunk_id` should be a deterministic function of `(source, source_id, chunk_index, content_sha256)` or similar purely-content fields. Audit each scanner's emit path for any non-determinism.

## Test to add

A property test: run the ingest pipeline twice on a fixed fixture tree; assert that `ingest_runs[-1].chunks_upserted == 0`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan: chunk_id not fully content-addressed — several scanners re-embed identical content #2

Summary

Repro

Evidence

Why it matters

Likely cause

Fix direction

Test to add

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Source	chunks	upserts per scan
`ostk-site` (markdown)	124	59
`fcp` (markdown)	601	8
`molva`	790	9
`needle-bench`	259	17
`samizdat-mesh`	2301	22
`mish` (docs)	219	27

scan: chunk_id not fully content-addressed — several scanners re-embed identical content #2

Description

Summary

Repro

Evidence

Why it matters

Likely cause

Fix direction

Test to add

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions