Skip to content

feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212)#215

Merged
mbachaud merged 1 commit into
masterfrom
feat/212-crawl-watchdog
Jun 12, 2026
Merged

feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212)#215
mbachaud merged 1 commit into
masterfrom
feat/212-crawl-watchdog

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

Closes #212. Pure CrawlDetector (EMA genes/s vs first-window median baseline; trip = EMA < baseline/HELIX_BFM_CRAWL_FACTOR for HELIX_BFM_CRAWL_WINDOW consecutive batches AND vram_frac > 0.92; hysteresis; disarms after terminal rung). Backfill path: ladder in backfill_bgem3_v2.backfill_dense_db (standalone script + fixture builder both covered) - rung 1 gc+empty_cache, rung 2 reload codec on CPU for the shard remainder. Ingest path: _drain_with_batched_splade raises the existing #183 _PauseRequested at a committed batch boundary (salvage + file-level resume reused). Knobs: HELIX_BFM_CRAWL_FACTOR=5 / _WINDOW=8 / _ACTION=ladder|cpu|off. All logs grep-able via [crawl-watchdog]. DENSE_VRAM.md documents the 2026-06-10/11 incidents this automates: confluence 15.3->2.3 genes/s; eng-oncall 64 genes/47min WITH release knobs active (empty_cache cannot un-spill - context recycle/CPU demotion is the fix); two live auto-recycles by the interim external watchdog, incl. 18:41 @ 1.03 genes/s / 11,941 MiB -> build finished 33 min later. Tests: 20 new GPU-free detector/wire tests; 35 passed locally across watchdog+resume+auto-subshard suites.

…for dense ingest/backfill (#212)

Fallback automation for the #176 WDDM-spill crawl, designed from two live
incidents during the 2026-06-10/11 ERB 500K rebuild on the 12 GB rig:

- confluence shard decayed 15.3 -> 2.3 genes/s with dedicated VRAM pinned
  at 11.85/12 GB; recovered only because the shard finished.
- slack__eng-oncall collapsed to 64 genes / 47 min (~0.02 genes/s, ~66 h
  projected) WITH the release knobs active (HELIX_DENSE_VRAM_RELEASE_EVERY=64
  + expandable_segments) — proof that periodic empty_cache does not un-crawl
  an already-spilled context. The manual fix that worked: kill + resume
  (fresh CUDA context; salvage + file-level resume lossless), then
  BGEM3_DEVICE=cpu for the remainder (GPU 11.9 GB -> 884 MiB,
  byte-identical vectors).
- The recycle action was validated live twice by the external watchdog on
  06-11 (12:00 and 18:41); the 18:41 trip fired at 1.03 genes/s /
  11941 MiB, recycled the worker, and the build finished 33 min later.

Not a wall-clock per-shard timer (false-positives on legitimately large
shards, misses crawls on small ones). Trigger is the unambiguous
signature: per-batch genes/s EMA < the shard's OWN early-batch baseline
(median of the first HELIX_BFM_CRAWL_WINDOW batches, default 8) /
HELIX_BFM_CRAWL_FACTOR (default 5) for WINDOW consecutive batches
(hysteresis: any healthy batch resets the streak) AND dedicated VRAM
near capacity (max(allocated, reserved)/total > 0.92, try/except-guarded
so CPU-only boxes structurally cannot trip).

Ladder (HELIX_BFM_CRAWL_ACTION=ladder|cpu|off, default ladder):
- rung 1: gc.collect + torch.cuda.empty_cache, WARNING log, continue;
- rung 2 (still crawling after another window):
  * ingest path (_drain_with_batched_splade): raise the existing #183
    _PauseRequested at the committed batch boundary — shard pauses
    cleanly, salvage + file-level resume restart it with a fresh CUDA
    context (reused machinery, nothing reinvented);
  * backfill path (backfill_bgem3_v2.backfill_dense_db, shared by the
    operator script and the fixture builder's _backfill_dense pass —
    env knobs honored there so both get it): tear down the codec and
    reload it with device=cpu for the REMAINDER of the shard
    (BGEM3_DEVICE=cpu semantics, byte-identical vectors);
- "cpu" jumps straight to the terminal rung; "off" detects + logs only.

The detector is a pure class (scripts/crawl_watchdog.py, CrawlDetector
.feed(genes, dt, vram_frac) -> action) — no torch import, no clock reads
— so tests drive it with a fake feed; CUDA probing lives in two guarded
helpers. Every line carries the stable grep-able prefix
"[crawl-watchdog]" with rate, baseline, vram fraction and action taken.

Docs: DENSE_VRAM.md gains a "Crawl watchdog (automatic)" section with
the incident evidence and knob reference.

Tests: tests/test_crawl_watchdog.py — 20 GPU-free tests covering
baseline establishment, healthy/no-trip, window-exact tripping, the
vram-low and vram-unknown no-trip guards, hysteresis, EMA smoothing,
all three action modes, env-knob parsing + garbage fallback, and
wire-level coverage of both paths (_PauseRequested raise on ingest
demote; codec CPU reload + injected-codec log-only on backfill demote).

Closes #212.
@mbachaud mbachaud merged commit 88aee29 into master Jun 12, 2026
3 checks passed
@mbachaud mbachaud deleted the feat/212-crawl-watchdog branch June 12, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crawl watchdog for dense ingest/backfill: throughput-triggered escalation ladder (detect -> empty_cache -> recycle CUDA context -> CPU demotion)

1 participant