fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214)#217
Merged
Conversation
…leg can no longer be gated to zero (#214) Evidence (two independent measurements): * cc-exchange embedding-upgrade turn 0024 (L1b), 125-query corpus: the margin-over-random calibration (mu + 3*sigma over RANDOM linear-shard gene pairs) produced ann_threshold = 0.779, which sat ABOVE every real query-doc cosine in the corpus (max ~0.713; golds 0.46-0.68). Result: 0 of 5000 pool-doc chunks cleared the gate by dense — the dense leg admitted NOTHING into pool construction and the 200-pool was built almost entirely from lexically-pinned docs. 80/84 never-surfaced golds were dense-STRONG (sampled 22/22 pure-dense rank < 200, median 13). * a 480-question ERB run measured 70.0% never-surfaced golds (gold not in top-10), matching an independent 67.2% measurement. Pool membership is strictly upstream of fusion ranking: no dense re-weighting can recover a candidate the gate already dropped (@200_lt invariant at 41 under any fusion re-weight; provisional ceiling 41->121). A threshold that admits zero dense candidates is mis-calibration by definition. This is the catastrophic case of the threshold-transfer problem in docs/specs/2026-06-09-retrieval-profiles.md (Layer 1): random-pair +3sigma anisotropy exceeding the real query-doc cosine range. Fix A — runtime dense pool floor (the surgical fix): * The gate lives in KnowledgeStore.query_docs_ann step 5 (threshold cut + min_genes floor + max_genes cap). It is now factored into the module-level pure function knowledge_store.apply_ann_gate, which reproduces the legacy loop byte-for-byte and adds the floor: when fewer than [retrieval] dense_pool_floor_genes dense-scored candidates survive the cut — but the dense leg HAD scored candidates — the top-N dense hits by cosine are appended to the pool. They then compete normally in fusion scoring; nothing else about them is marked different. 0 disables (legacy gate-only). Default 8 ON, justified by the two measurements above: the floor makes the dense leg degrade gracefully instead of dying. * Plumbed config.py (RetrievalConfig field + TOML loader) -> context_manager open_read_source kwargs -> KnowledgeStore ctor, following the dense_additive_weight pattern; routes_admin hot-swap path kept in sync. In sharded mode the kwarg fans to every per-shard KnowledgeStore (the gate is store-level; the V1 ShardedGenomeAdapter ANN surface is unreachable), so the floor holds per shard and the router-level merge dedups as usual. Documented in helix.toml. Fix B — true-pair calibration cap (the root cause): * scripts/calibrate_thresholds.py gains a true-pair guard: sample M genes (--true-pairs, default 200; 0 disables), synthesize a query per gene from its leading ~12 content tokens (the auto-synth convention of benchmarks/sweep_dense_additive_weight.py), encode with the same BGE-M3 codec the runtime uses (task="query"), and measure TRUE query-doc cosines against each gene's own stored vector. Final threshold = min(random_pair_mu_plus_sigma, P05(true_pairs) - 0.02). * Both bounds + the winning bound are recorded in the TOML snippet, the calibration_report.json (ann_threshold.random_pair_threshold + ann_threshold.true_pair_guard), and the genome_calibration DB payload (extra keys are ignored by _get_effective_ann_threshold and surface through the provenance/diagnostics endpoints). * If the codec/model is unavailable (no GPU / no transformers), the guard is skipped with a WARNING and the legacy random-pair value is emitted — calibration never hard-requires a model, and tests do not need one. Also added the standard repo-root sys.path shim so the script (and the guard) work from a source checkout without `pip install -e .`. Tests — tests/test_dense_pool_floor.py, 21 cases, no model needed: * apply_ann_gate unit tests on synthetic cosine arrays: threshold above all candidates -> top-8 admitted; healthy gate (>= 8 passing) -> byte-identical to legacy; floor=0 -> legacy empty result; fewer than N total dense candidates -> all admitted; rescued ordering follows cosine; min_genes rescues stay ahead of floor appends; floor counts already-admitted dense and only tops up; empty dense leg -> no-op. * Config plumbing: default 8, TOML load (incl. 0-disables), reaches the store, store default. * Calibration cap math (pure functions): true-pair cap wins for the #214 shape (0.779 vs 0.46-0.68 true pairs) and lands below the real cosine max; random-pair bound wins when already below; missing true pairs -> legacy value + guard_skipped flag; skip/cap thread end-to-end through apply_true_pair_guard into emit_report/emit_toml_snippet; query synthesis token rule. * Integration-lite: in-memory Genome (conftest FakeBGEM3Codec + hash_vec, mirroring test_dense_recall.py fixtures), ann threshold 0.99: dense candidates still surface through query_docs_ann with the floor on, and are gated out entirely with floor=0. No existing test needed adjustment: the suite ran green with the floor default ON (2197 passed; the only 4 failures are pre-existing and environment-only — missing transformers / authority-boost cases — and reproduce identically on the base commit).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #214. Evidence: the mu+3sigma random-pair calibrated threshold (0.779) sat above every real query-doc cosine (max ~0.713, golds 0.46-0.68) -> 0/5000 dense docs cleared pool construction; corroborated by 70.0% never-surfaced on the 480-question ERB headline run (#93), matching an independent 67.2% on a 125-question set (cc-exchange Experiment B). Pool membership is strictly upstream of fusion - no re-weight can recover a gold the gate never admits.
Fix A (runtime, surgical): [retrieval] dense_pool_floor_genes = 8 - the gate (KnowledgeStore.query_docs_ann step 5, factored into pure apply_ann_gate) now admits the top-N dense candidates by cosine whenever fewer than N survived the threshold but the dense leg had scored candidates. Degrades gracefully instead of dying; 0 restores the legacy gate-only path; per-shard in sharded mode with normal router dedup. Byte-identical when the gate is healthy (legacy cut reproduced exactly; floor only fires while dense is starving).
Fix B (root cause): scripts/calibrate_thresholds.py caps the random-pair bound at P05(true query-doc pairs) - 0.02 (--true-pairs, default 200; query = first ~12 content tokens, runtime codec; codec unavailable -> WARNING + legacy value). Both bounds + winner recorded in the TOML snippet, report JSON, and genome_calibration payload.
Tests: 21 new (gate units incl. byte-identical healthy path, config plumb, cap math, integration-lite at threshold 0.99 floor-on/off); 54 passed locally across floor+dense_recall+ann_threshold+config; 2197-test sandbox sweep clean (4 failures pre-existing on base). Acceptance: ERB strat-90 + full-480 rerun vs the 30.0% recall@10 baseline, posting to #214/#93.