feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212)#215
Merged
Merged
Conversation
…for dense ingest/backfill (#212) Fallback automation for the #176 WDDM-spill crawl, designed from two live incidents during the 2026-06-10/11 ERB 500K rebuild on the 12 GB rig: - confluence shard decayed 15.3 -> 2.3 genes/s with dedicated VRAM pinned at 11.85/12 GB; recovered only because the shard finished. - slack__eng-oncall collapsed to 64 genes / 47 min (~0.02 genes/s, ~66 h projected) WITH the release knobs active (HELIX_DENSE_VRAM_RELEASE_EVERY=64 + expandable_segments) — proof that periodic empty_cache does not un-crawl an already-spilled context. The manual fix that worked: kill + resume (fresh CUDA context; salvage + file-level resume lossless), then BGEM3_DEVICE=cpu for the remainder (GPU 11.9 GB -> 884 MiB, byte-identical vectors). - The recycle action was validated live twice by the external watchdog on 06-11 (12:00 and 18:41); the 18:41 trip fired at 1.03 genes/s / 11941 MiB, recycled the worker, and the build finished 33 min later. Not a wall-clock per-shard timer (false-positives on legitimately large shards, misses crawls on small ones). Trigger is the unambiguous signature: per-batch genes/s EMA < the shard's OWN early-batch baseline (median of the first HELIX_BFM_CRAWL_WINDOW batches, default 8) / HELIX_BFM_CRAWL_FACTOR (default 5) for WINDOW consecutive batches (hysteresis: any healthy batch resets the streak) AND dedicated VRAM near capacity (max(allocated, reserved)/total > 0.92, try/except-guarded so CPU-only boxes structurally cannot trip). Ladder (HELIX_BFM_CRAWL_ACTION=ladder|cpu|off, default ladder): - rung 1: gc.collect + torch.cuda.empty_cache, WARNING log, continue; - rung 2 (still crawling after another window): * ingest path (_drain_with_batched_splade): raise the existing #183 _PauseRequested at the committed batch boundary — shard pauses cleanly, salvage + file-level resume restart it with a fresh CUDA context (reused machinery, nothing reinvented); * backfill path (backfill_bgem3_v2.backfill_dense_db, shared by the operator script and the fixture builder's _backfill_dense pass — env knobs honored there so both get it): tear down the codec and reload it with device=cpu for the REMAINDER of the shard (BGEM3_DEVICE=cpu semantics, byte-identical vectors); - "cpu" jumps straight to the terminal rung; "off" detects + logs only. The detector is a pure class (scripts/crawl_watchdog.py, CrawlDetector .feed(genes, dt, vram_frac) -> action) — no torch import, no clock reads — so tests drive it with a fake feed; CUDA probing lives in two guarded helpers. Every line carries the stable grep-able prefix "[crawl-watchdog]" with rate, baseline, vram fraction and action taken. Docs: DENSE_VRAM.md gains a "Crawl watchdog (automatic)" section with the incident evidence and knob reference. Tests: tests/test_crawl_watchdog.py — 20 GPU-free tests covering baseline establishment, healthy/no-trip, window-exact tripping, the vram-low and vram-unknown no-trip guards, hysteresis, EMA smoothing, all three action modes, env-knob parsing + garbage fallback, and wire-level coverage of both paths (_PauseRequested raise on ingest demote; codec CPU reload + injected-codec log-only on backfill demote). Closes #212.
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #212. Pure CrawlDetector (EMA genes/s vs first-window median baseline; trip = EMA < baseline/HELIX_BFM_CRAWL_FACTOR for HELIX_BFM_CRAWL_WINDOW consecutive batches AND vram_frac > 0.92; hysteresis; disarms after terminal rung). Backfill path: ladder in backfill_bgem3_v2.backfill_dense_db (standalone script + fixture builder both covered) - rung 1 gc+empty_cache, rung 2 reload codec on CPU for the shard remainder. Ingest path: _drain_with_batched_splade raises the existing #183 _PauseRequested at a committed batch boundary (salvage + file-level resume reused). Knobs: HELIX_BFM_CRAWL_FACTOR=5 / _WINDOW=8 / _ACTION=ladder|cpu|off. All logs grep-able via [crawl-watchdog]. DENSE_VRAM.md documents the 2026-06-10/11 incidents this automates: confluence 15.3->2.3 genes/s; eng-oncall 64 genes/47min WITH release knobs active (empty_cache cannot un-spill - context recycle/CPU demotion is the fix); two live auto-recycles by the interim external watchdog, incl. 18:41 @ 1.03 genes/s / 11,941 MiB -> build finished 33 min later. Tests: 20 new GPU-free detector/wire tests; 35 passed locally across watchdog+resume+auto-subshard suites.