Skip to content

fix(bench): sharder catch-all for root-level files + builder coverage assertion (#213)#216

Merged
mbachaud merged 1 commit into
masterfrom
fix/213-sharder-rootfile-gap
Jun 12, 2026
Merged

fix(bench): sharder catch-all for root-level files + builder coverage assertion (#213)#216
mbachaud merged 1 commit into
masterfrom
fix/213-sharder-rootfile-gap

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

Closes #213. Root cause: _decompose_oversized_root (#147 auto-subshard) split a root along top-level subdirs only - files directly at the root (slack/-*.json, no channel folder) matched no sub-shard prefix and were never walked by any task. Measured impact (cc-exchange turn 0024 L1a, fresh-verified): 4/84 never-surfaced golds, 0 slack-root docs indexed anywhere, +3.2pp @200_lt ceiling. Fix: (1) decompose appends a self-identifying __root catch-all shard for root-level eligible files (collision-guarded; subdir-only/flat/depth behavior unchanged), with rootfiles_only task marking + subdir skip-sets so nothing ingests twice; (2) new _assert_shard_coverage after task construction - one cached-prefix walk verifying every eligible file maps to a task, ERROR + orphan list + raise on gap (HELIX_BFM_COVERAGE_CHECK=0 to skip): silent fall-through is now a build error. Repro test verified failing on pre-fix master. 10 new tests; 69 passed across sharder suites in sandbox, 25 locally. Affected roots need a resume-safe re-ingest to pick up the orphaned files.

… assertion (#213)

Measured on the 100-shard v2 Onyx corpus (cc-exchange embedding-upgrade
turn 0024, L1a; fresh-verified, author != verifier): four
slack/<epoch>-*.json files living DIRECTLY at the slack source root (no
channel subfolder) were never assigned to ANY shard — doc-level scan:
80/84 never-surfaced golds in-index, 4/84 missing, 0 slack-root docs
indexed anywhere. Ceiling: +4 golds @200_lt (41 -> 45, +3.2pp on the
125-query set). Unfixable by any retrieval/fusion change.

Root cause: _decompose_oversized_root (#147 auto-subshard) splits an
oversized root along its top-level subdirectories and returns only the
per-subdir (slug, path) entries. Files directly under the root match no
sub-shard prefix and silently fall through — no task ever walks them.

Fix:
- _decompose_oversized_root appends a dedicated, self-identifying
  "<root-slug>__root" catch-all entry whenever the decomposed root has
  eligible files at its top level, with a deterministic collision guard
  against a genuine subdir named "root". Subdir-only roots, the
  flat-layout fallback, and the depth guard are byte-for-byte unchanged.
- build_profile_sharded marks catch-all tasks rootfiles_only and extends
  their skip set with every immediate subdir name, so the __root shard
  walks exactly the root-level files and cannot duplicate the per-subdir
  shards. The pre-ingest sizing pass now sizes each task with its own
  skip set.
- Builder coverage assertion: after task-list construction, one walk of
  the profile roots (same eligibility filters as ingest, per-directory
  coverage cache — cheap prefix check, no re-walk per task) verifies
  every eligible file falls under some task's root; violations log ERROR
  with the orphan list and raise. Default ON; HELIX_BFM_COVERAGE_CHECK=0
  skips it for speed on huge corpora.

tests/test_sharder_rootfile_coverage.py pins the failing-on-master repro
(2 subdirs + 2 root-level files -> orphans on master), flat-root and
subdir-only no-regression, __root slug naming/determinism + collision
guard, the rootfiles-only walk, coverage pass / raise-with-orphan-list /
env kill-switch / default-on, and the build_profile_sharded wiring.

Affected fixture roots need a re-ingest to pick up their __root shards
(resume-safe: complete subdir shards salvage untouched).

Closes #213.
@mbachaud mbachaud merged commit 9aac08c into master Jun 12, 2026
3 checks passed
@mbachaud mbachaud deleted the fix/213-sharder-rootfile-gap branch June 12, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sharder gap: channel-less slack-root files are never assigned a shard (0 indexed)

1 participant