feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212) by mbachaud · Pull Request #215 · mbachaud/helix-context

mbachaud · 2026-06-12T05:28:09Z

Closes #212. Pure CrawlDetector (EMA genes/s vs first-window median baseline; trip = EMA < baseline/HELIX_BFM_CRAWL_FACTOR for HELIX_BFM_CRAWL_WINDOW consecutive batches AND vram_frac > 0.92; hysteresis; disarms after terminal rung). Backfill path: ladder in backfill_bgem3_v2.backfill_dense_db (standalone script + fixture builder both covered) - rung 1 gc+empty_cache, rung 2 reload codec on CPU for the shard remainder. Ingest path: _drain_with_batched_splade raises the existing #183 _PauseRequested at a committed batch boundary (salvage + file-level resume reused). Knobs: HELIX_BFM_CRAWL_FACTOR=5 / _WINDOW=8 / _ACTION=ladder|cpu|off. All logs grep-able via [crawl-watchdog]. DENSE_VRAM.md documents the 2026-06-10/11 incidents this automates: confluence 15.3->2.3 genes/s; eng-oncall 64 genes/47min WITH release knobs active (empty_cache cannot un-spill - context recycle/CPU demotion is the fix); two live auto-recycles by the interim external watchdog, incl. 18:41 @ 1.03 genes/s / 11,941 MiB -> build finished 33 min later. Tests: 20 new GPU-free detector/wire tests; 35 passed locally across watchdog+resume+auto-subshard suites.

…for dense ingest/backfill (#212) Fallback automation for the #176 WDDM-spill crawl, designed from two live incidents during the 2026-06-10/11 ERB 500K rebuild on the 12 GB rig: - confluence shard decayed 15.3 -> 2.3 genes/s with dedicated VRAM pinned at 11.85/12 GB; recovered only because the shard finished. - slack__eng-oncall collapsed to 64 genes / 47 min (~0.02 genes/s, ~66 h projected) WITH the release knobs active (HELIX_DENSE_VRAM_RELEASE_EVERY=64 + expandable_segments) — proof that periodic empty_cache does not un-crawl an already-spilled context. The manual fix that worked: kill + resume (fresh CUDA context; salvage + file-level resume lossless), then BGEM3_DEVICE=cpu for the remainder (GPU 11.9 GB -> 884 MiB, byte-identical vectors). - The recycle action was validated live twice by the external watchdog on 06-11 (12:00 and 18:41); the 18:41 trip fired at 1.03 genes/s / 11941 MiB, recycled the worker, and the build finished 33 min later. Not a wall-clock per-shard timer (false-positives on legitimately large shards, misses crawls on small ones). Trigger is the unambiguous signature: per-batch genes/s EMA < the shard's OWN early-batch baseline (median of the first HELIX_BFM_CRAWL_WINDOW batches, default 8) / HELIX_BFM_CRAWL_FACTOR (default 5) for WINDOW consecutive batches (hysteresis: any healthy batch resets the streak) AND dedicated VRAM near capacity (max(allocated, reserved)/total > 0.92, try/except-guarded so CPU-only boxes structurally cannot trip). Ladder (HELIX_BFM_CRAWL_ACTION=ladder|cpu|off, default ladder): - rung 1: gc.collect + torch.cuda.empty_cache, WARNING log, continue; - rung 2 (still crawling after another window): * ingest path (_drain_with_batched_splade): raise the existing #183 _PauseRequested at the committed batch boundary — shard pauses cleanly, salvage + file-level resume restart it with a fresh CUDA context (reused machinery, nothing reinvented); * backfill path (backfill_bgem3_v2.backfill_dense_db, shared by the operator script and the fixture builder's _backfill_dense pass — env knobs honored there so both get it): tear down the codec and reload it with device=cpu for the REMAINDER of the shard (BGEM3_DEVICE=cpu semantics, byte-identical vectors); - "cpu" jumps straight to the terminal rung; "off" detects + logs only. The detector is a pure class (scripts/crawl_watchdog.py, CrawlDetector .feed(genes, dt, vram_frac) -> action) — no torch import, no clock reads — so tests drive it with a fake feed; CUDA probing lives in two guarded helpers. Every line carries the stable grep-able prefix "[crawl-watchdog]" with rate, baseline, vram fraction and action taken. Docs: DENSE_VRAM.md gains a "Crawl watchdog (automatic)" section with the incident evidence and knob reference. Tests: tests/test_crawl_watchdog.py — 20 GPU-free tests covering baseline establishment, healthy/no-trip, window-exact tripping, the vram-low and vram-unknown no-trip guards, hysteresis, EMA smoothing, all three action modes, env-knob parsing + garbage fallback, and wire-level coverage of both paths (_PauseRequested raise on ingest demote; codec CPU reload + injected-codec log-only on backfill demote). Closes #212.

mbachaud merged commit 88aee29 into master Jun 12, 2026
3 checks passed

mbachaud deleted the feat/212-crawl-watchdog branch June 12, 2026 05:41

This was referenced Jun 12, 2026

Wall-2: decide fate of unmerged dense-latency PRs #158/#160 #206

Open

Bench: evaluate EnterpriseRAG-Bench as a blob-vs-sharded test corpus #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212)#215

feat(bench): crawl watchdog - throughput-triggered escalation ladder for dense ingest/backfill (#212)#215
mbachaud merged 1 commit into
masterfrom
feat/212-crawl-watchdog

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant