fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214) by mbachaud · Pull Request #217 · mbachaud/helix-context

mbachaud · 2026-06-12T17:17:18Z

Closes #214. Evidence: the mu+3sigma random-pair calibrated threshold (0.779) sat above every real query-doc cosine (max ~0.713, golds 0.46-0.68) -> 0/5000 dense docs cleared pool construction; corroborated by 70.0% never-surfaced on the 480-question ERB headline run (#93), matching an independent 67.2% on a 125-question set (cc-exchange Experiment B). Pool membership is strictly upstream of fusion - no re-weight can recover a gold the gate never admits.

Fix A (runtime, surgical): [retrieval] dense_pool_floor_genes = 8 - the gate (KnowledgeStore.query_docs_ann step 5, factored into pure apply_ann_gate) now admits the top-N dense candidates by cosine whenever fewer than N survived the threshold but the dense leg had scored candidates. Degrades gracefully instead of dying; 0 restores the legacy gate-only path; per-shard in sharded mode with normal router dedup. Byte-identical when the gate is healthy (legacy cut reproduced exactly; floor only fires while dense is starving).

Fix B (root cause): scripts/calibrate_thresholds.py caps the random-pair bound at P05(true query-doc pairs) - 0.02 (--true-pairs, default 200; query = first ~12 content tokens, runtime codec; codec unavailable -> WARNING + legacy value). Both bounds + winner recorded in the TOML snippet, report JSON, and genome_calibration payload.

Tests: 21 new (gate units incl. byte-identical healthy path, config plumb, cap math, integration-lite at threshold 0.99 floor-on/off); 54 passed locally across floor+dense_recall+ann_threshold+config; 2197-test sandbox sweep clean (4 failures pre-existing on base). Acceptance: ERB strat-90 + full-480 rerun vs the 30.0% recall@10 baseline, posting to #214/#93.

…leg can no longer be gated to zero (#214) Evidence (two independent measurements): * cc-exchange embedding-upgrade turn 0024 (L1b), 125-query corpus: the margin-over-random calibration (mu + 3*sigma over RANDOM linear-shard gene pairs) produced ann_threshold = 0.779, which sat ABOVE every real query-doc cosine in the corpus (max ~0.713; golds 0.46-0.68). Result: 0 of 5000 pool-doc chunks cleared the gate by dense — the dense leg admitted NOTHING into pool construction and the 200-pool was built almost entirely from lexically-pinned docs. 80/84 never-surfaced golds were dense-STRONG (sampled 22/22 pure-dense rank < 200, median 13). * a 480-question ERB run measured 70.0% never-surfaced golds (gold not in top-10), matching an independent 67.2% measurement. Pool membership is strictly upstream of fusion ranking: no dense re-weighting can recover a candidate the gate already dropped (@200_lt invariant at 41 under any fusion re-weight; provisional ceiling 41->121). A threshold that admits zero dense candidates is mis-calibration by definition. This is the catastrophic case of the threshold-transfer problem in docs/specs/2026-06-09-retrieval-profiles.md (Layer 1): random-pair +3sigma anisotropy exceeding the real query-doc cosine range. Fix A — runtime dense pool floor (the surgical fix): * The gate lives in KnowledgeStore.query_docs_ann step 5 (threshold cut + min_genes floor + max_genes cap). It is now factored into the module-level pure function knowledge_store.apply_ann_gate, which reproduces the legacy loop byte-for-byte and adds the floor: when fewer than [retrieval] dense_pool_floor_genes dense-scored candidates survive the cut — but the dense leg HAD scored candidates — the top-N dense hits by cosine are appended to the pool. They then compete normally in fusion scoring; nothing else about them is marked different. 0 disables (legacy gate-only). Default 8 ON, justified by the two measurements above: the floor makes the dense leg degrade gracefully instead of dying. * Plumbed config.py (RetrievalConfig field + TOML loader) -> context_manager open_read_source kwargs -> KnowledgeStore ctor, following the dense_additive_weight pattern; routes_admin hot-swap path kept in sync. In sharded mode the kwarg fans to every per-shard KnowledgeStore (the gate is store-level; the V1 ShardedGenomeAdapter ANN surface is unreachable), so the floor holds per shard and the router-level merge dedups as usual. Documented in helix.toml. Fix B — true-pair calibration cap (the root cause): * scripts/calibrate_thresholds.py gains a true-pair guard: sample M genes (--true-pairs, default 200; 0 disables), synthesize a query per gene from its leading ~12 content tokens (the auto-synth convention of benchmarks/sweep_dense_additive_weight.py), encode with the same BGE-M3 codec the runtime uses (task="query"), and measure TRUE query-doc cosines against each gene's own stored vector. Final threshold = min(random_pair_mu_plus_sigma, P05(true_pairs) - 0.02). * Both bounds + the winning bound are recorded in the TOML snippet, the calibration_report.json (ann_threshold.random_pair_threshold + ann_threshold.true_pair_guard), and the genome_calibration DB payload (extra keys are ignored by _get_effective_ann_threshold and surface through the provenance/diagnostics endpoints). * If the codec/model is unavailable (no GPU / no transformers), the guard is skipped with a WARNING and the legacy random-pair value is emitted — calibration never hard-requires a model, and tests do not need one. Also added the standard repo-root sys.path shim so the script (and the guard) work from a source checkout without `pip install -e .`. Tests — tests/test_dense_pool_floor.py, 21 cases, no model needed: * apply_ann_gate unit tests on synthetic cosine arrays: threshold above all candidates -> top-8 admitted; healthy gate (>= 8 passing) -> byte-identical to legacy; floor=0 -> legacy empty result; fewer than N total dense candidates -> all admitted; rescued ordering follows cosine; min_genes rescues stay ahead of floor appends; floor counts already-admitted dense and only tops up; empty dense leg -> no-op. * Config plumbing: default 8, TOML load (incl. 0-disables), reaches the store, store default. * Calibration cap math (pure functions): true-pair cap wins for the #214 shape (0.779 vs 0.46-0.68 true pairs) and lands below the real cosine max; random-pair bound wins when already below; missing true pairs -> legacy value + guard_skipped flag; skip/cap thread end-to-end through apply_true_pair_guard into emit_report/emit_toml_snippet; query synthesis token rule. * Integration-lite: in-memory Genome (conftest FakeBGEM3Codec + hash_vec, mirroring test_dense_recall.py fixtures), ann threshold 0.99: dense candidates still surface through query_docs_ann with the floor on, and are gated out entirely with floor=0. No existing test needed adjustment: the suite ran green with the floor default ON (2197 passed; the only 4 failures are pre-existing and environment-only — missing transformers / authority-boost cases — and reproduce identically on the base commit).

mbachaud merged commit 9af4c33 into master Jun 12, 2026
3 checks passed

mbachaud deleted the fix/214-dense-pool-floor branch June 12, 2026 17:21

This was referenced Jun 12, 2026

ANN-threshold calibration method gates the dense leg out of pool construction (0/5000 clear the calibrated 0.779) #214

Closed

Re-baseline SIKE curated needles as a scale sweep: XL + ERB 10K/50K/850K distractor beds #221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214)#217

fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214)#217
mbachaud merged 1 commit into
masterfrom
fix/214-dense-pool-floor

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant