Skip to content

Crawl watchdog for dense ingest/backfill: throughput-triggered escalation ladder (detect -> empty_cache -> recycle CUDA context -> CPU demotion) #212

@mbachaud

Description

@mbachaud

Fallback automation for the #176 WDDM-spill crawl, designed from two live incidents during the 2026-06-10/11 ERB 500K rebuild on a 12 GB rig:
(1) confluence shard decayed 15.3 -> 2.3 genes/s while dedicated VRAM pinned 11.85/12 GB (recovered only because the shard finished);
(2) slack__eng-oncall collapsed to 64 genes / 47 min (~0.02 genes/s, ~66h projected) with HELIX_DENSE_VRAM_RELEASE_EVERY=64 + expandable_segments ACTIVE — proving empty_cache alone does not un-crawl an already-spilled context. Manual fix that worked: kill + resume (fresh CUDA context; salvage + file-level resume lossless), then BGEM3_DEVICE=cpu for the remainder (GPU 11.9 GB -> 884 MiB, byte-identical vectors).

NOT a wall-clock per-shard timer (false-positives on legitimately large shards; misses crawls on small ones). Trigger on the unambiguous signature: genes/s EMA < shard's own early-batch baseline / HELIX_BFM_CRAWL_FACTOR (default 5) for N consecutive batches AND dedicated VRAM ~full (torch.cuda / pynvml).

Escalation ladder (every rung already exists as shipped machinery):

  1. DETECT: log + helix_dense_crawl_detected counter (rides the Telemetry gaps: 8 phantom dashboard metrics, genai_telemetry missing from master, top-5 unexported tuning signals #209/feat(telemetry): reconcile dashboards with real instruments + top-5 tuning signals (#209 phase 1) #211 instruments).
  2. empty_cache + gc at the next batch boundary (cheap; occasionally sufficient early).
  3. RECYCLE: self-pause via the feat(bench): file-level resume + SIGINT pause-then-resume for fixture builder (closes #150, #151) #183 _PauseRequested batch-boundary machinery, exit worker, supervisor respawns -> fresh CUDA context lands allocations in dedicated; _try_salvage_complete_shard + _filter_to_unseen make it lossless.
  4. CPU DEMOTION: second crawl on the same shard -> set BGEM3_DEVICE=cpu for that shard's backfill and continue. Terminal rung; always converges.

Knobs: HELIX_BFM_CRAWL_FACTOR (default 5), HELIX_BFM_CRAWL_WINDOW (batches, default 8), HELIX_BFM_CRAWL_ACTION (ladder|cpu|off). Scope: scripts/build_fixture_matrix.py supervisor loop + rate tracker in _drain_with_batched_splade and the backfill driver; also consider 'CPU backfill by default on <=12 GB rigs' per docs/operations/DENSE_VRAM.md. Update that runbook + this run's evidence when implementing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions