Crawl watchdog for dense ingest/backfill: throughput-triggered escalation ladder (detect -> empty_cache -> recycle CUDA context -> CPU demotion)

Fallback automation for the #176 WDDM-spill crawl, designed from two live incidents during the 2026-06-10/11 ERB 500K rebuild on a 12 GB rig:
(1) confluence shard decayed 15.3 -> 2.3 genes/s while dedicated VRAM pinned 11.85/12 GB (recovered only because the shard finished);
(2) slack__eng-oncall collapsed to 64 genes / 47 min (~0.02 genes/s, ~66h projected) with HELIX_DENSE_VRAM_RELEASE_EVERY=64 + expandable_segments ACTIVE — proving empty_cache alone does not un-crawl an already-spilled context. Manual fix that worked: kill + resume (fresh CUDA context; salvage + file-level resume lossless), then BGEM3_DEVICE=cpu for the remainder (GPU 11.9 GB -> 884 MiB, byte-identical vectors).

NOT a wall-clock per-shard timer (false-positives on legitimately large shards; misses crawls on small ones). Trigger on the unambiguous signature: genes/s EMA < shard's own early-batch baseline / HELIX_BFM_CRAWL_FACTOR (default 5) for N consecutive batches AND dedicated VRAM ~full (torch.cuda / pynvml).

Escalation ladder (every rung already exists as shipped machinery):
1. DETECT: log + helix_dense_crawl_detected counter (rides the #209/#211 instruments).
2. empty_cache + gc at the next batch boundary (cheap; occasionally sufficient early).
3. RECYCLE: self-pause via the #183 _PauseRequested batch-boundary machinery, exit worker, supervisor respawns -> fresh CUDA context lands allocations in dedicated; _try_salvage_complete_shard + _filter_to_unseen make it lossless.
4. CPU DEMOTION: second crawl on the same shard -> set BGEM3_DEVICE=cpu for that shard's backfill and continue. Terminal rung; always converges.

Knobs: HELIX_BFM_CRAWL_FACTOR (default 5), HELIX_BFM_CRAWL_WINDOW (batches, default 8), HELIX_BFM_CRAWL_ACTION (ladder|cpu|off). Scope: scripts/build_fixture_matrix.py supervisor loop + rate tracker in _drain_with_batched_splade and the backfill driver; also consider 'CPU backfill by default on <=12 GB rigs' per docs/operations/DENSE_VRAM.md. Update that runbook + this run's evidence when implementing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl watchdog for dense ingest/backfill: throughput-triggered escalation ladder (detect -> empty_cache -> recycle CUDA context -> CPU demotion) #212

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crawl watchdog for dense ingest/backfill: throughput-triggered escalation ladder (detect -> empty_cache -> recycle CUDA context -> CPU demotion) #212

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions