fix(bench): sharder catch-all for root-level files + builder coverage assertion (#213)#216
Merged
Merged
Conversation
… assertion (#213) Measured on the 100-shard v2 Onyx corpus (cc-exchange embedding-upgrade turn 0024, L1a; fresh-verified, author != verifier): four slack/<epoch>-*.json files living DIRECTLY at the slack source root (no channel subfolder) were never assigned to ANY shard — doc-level scan: 80/84 never-surfaced golds in-index, 4/84 missing, 0 slack-root docs indexed anywhere. Ceiling: +4 golds @200_lt (41 -> 45, +3.2pp on the 125-query set). Unfixable by any retrieval/fusion change. Root cause: _decompose_oversized_root (#147 auto-subshard) splits an oversized root along its top-level subdirectories and returns only the per-subdir (slug, path) entries. Files directly under the root match no sub-shard prefix and silently fall through — no task ever walks them. Fix: - _decompose_oversized_root appends a dedicated, self-identifying "<root-slug>__root" catch-all entry whenever the decomposed root has eligible files at its top level, with a deterministic collision guard against a genuine subdir named "root". Subdir-only roots, the flat-layout fallback, and the depth guard are byte-for-byte unchanged. - build_profile_sharded marks catch-all tasks rootfiles_only and extends their skip set with every immediate subdir name, so the __root shard walks exactly the root-level files and cannot duplicate the per-subdir shards. The pre-ingest sizing pass now sizes each task with its own skip set. - Builder coverage assertion: after task-list construction, one walk of the profile roots (same eligibility filters as ingest, per-directory coverage cache — cheap prefix check, no re-walk per task) verifies every eligible file falls under some task's root; violations log ERROR with the orphan list and raise. Default ON; HELIX_BFM_COVERAGE_CHECK=0 skips it for speed on huge corpora. tests/test_sharder_rootfile_coverage.py pins the failing-on-master repro (2 subdirs + 2 root-level files -> orphans on master), flat-root and subdir-only no-regression, __root slug naming/determinism + collision guard, the rootfiles-only walk, coverage pass / raise-with-orphan-list / env kill-switch / default-on, and the build_profile_sharded wiring. Affected fixture roots need a re-ingest to pick up their __root shards (resume-safe: complete subdir shards salvage untouched). Closes #213.
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #213. Root cause: _decompose_oversized_root (#147 auto-subshard) split a root along top-level subdirs only - files directly at the root (slack/-*.json, no channel folder) matched no sub-shard prefix and were never walked by any task. Measured impact (cc-exchange turn 0024 L1a, fresh-verified): 4/84 never-surfaced golds, 0 slack-root docs indexed anywhere, +3.2pp @200_lt ceiling. Fix: (1) decompose appends a self-identifying __root catch-all shard for root-level eligible files (collision-guarded; subdir-only/flat/depth behavior unchanged), with rootfiles_only task marking + subdir skip-sets so nothing ingests twice; (2) new _assert_shard_coverage after task construction - one cached-prefix walk verifying every eligible file maps to a task, ERROR + orphan list + raise on gap (HELIX_BFM_COVERAGE_CHECK=0 to skip): silent fall-through is now a build error. Repro test verified failing on pre-fix master. 10 new tests; 69 passed across sharder suites in sandbox, 25 locally. Affected roots need a resume-safe re-ingest to pick up the orphaned files.