Skip to content

fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214)#217

Merged
mbachaud merged 1 commit into
masterfrom
fix/214-dense-pool-floor
Jun 12, 2026
Merged

fix(retrieval): dense pool floor + true-pair calibration cap - dense leg can no longer be gated to zero (#214)#217
mbachaud merged 1 commit into
masterfrom
fix/214-dense-pool-floor

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

Closes #214. Evidence: the mu+3sigma random-pair calibrated threshold (0.779) sat above every real query-doc cosine (max ~0.713, golds 0.46-0.68) -> 0/5000 dense docs cleared pool construction; corroborated by 70.0% never-surfaced on the 480-question ERB headline run (#93), matching an independent 67.2% on a 125-question set (cc-exchange Experiment B). Pool membership is strictly upstream of fusion - no re-weight can recover a gold the gate never admits.

Fix A (runtime, surgical): [retrieval] dense_pool_floor_genes = 8 - the gate (KnowledgeStore.query_docs_ann step 5, factored into pure apply_ann_gate) now admits the top-N dense candidates by cosine whenever fewer than N survived the threshold but the dense leg had scored candidates. Degrades gracefully instead of dying; 0 restores the legacy gate-only path; per-shard in sharded mode with normal router dedup. Byte-identical when the gate is healthy (legacy cut reproduced exactly; floor only fires while dense is starving).

Fix B (root cause): scripts/calibrate_thresholds.py caps the random-pair bound at P05(true query-doc pairs) - 0.02 (--true-pairs, default 200; query = first ~12 content tokens, runtime codec; codec unavailable -> WARNING + legacy value). Both bounds + winner recorded in the TOML snippet, report JSON, and genome_calibration payload.

Tests: 21 new (gate units incl. byte-identical healthy path, config plumb, cap math, integration-lite at threshold 0.99 floor-on/off); 54 passed locally across floor+dense_recall+ann_threshold+config; 2197-test sandbox sweep clean (4 failures pre-existing on base). Acceptance: ERB strat-90 + full-480 rerun vs the 30.0% recall@10 baseline, posting to #214/#93.

…leg can no longer be gated to zero (#214)

Evidence (two independent measurements):

* cc-exchange embedding-upgrade turn 0024 (L1b), 125-query corpus: the
  margin-over-random calibration (mu + 3*sigma over RANDOM linear-shard
  gene pairs) produced ann_threshold = 0.779, which sat ABOVE every real
  query-doc cosine in the corpus (max ~0.713; golds 0.46-0.68). Result:
  0 of 5000 pool-doc chunks cleared the gate by dense — the dense leg
  admitted NOTHING into pool construction and the 200-pool was built
  almost entirely from lexically-pinned docs. 80/84 never-surfaced golds
  were dense-STRONG (sampled 22/22 pure-dense rank < 200, median 13).
* a 480-question ERB run measured 70.0% never-surfaced golds (gold not
  in top-10), matching an independent 67.2% measurement.

Pool membership is strictly upstream of fusion ranking: no dense
re-weighting can recover a candidate the gate already dropped (@200_lt
invariant at 41 under any fusion re-weight; provisional ceiling 41->121).
A threshold that admits zero dense candidates is mis-calibration by
definition. This is the catastrophic case of the threshold-transfer
problem in docs/specs/2026-06-09-retrieval-profiles.md (Layer 1):
random-pair +3sigma anisotropy exceeding the real query-doc cosine range.

Fix A — runtime dense pool floor (the surgical fix):

* The gate lives in KnowledgeStore.query_docs_ann step 5 (threshold cut
  + min_genes floor + max_genes cap). It is now factored into the
  module-level pure function knowledge_store.apply_ann_gate, which
  reproduces the legacy loop byte-for-byte and adds the floor: when
  fewer than [retrieval] dense_pool_floor_genes dense-scored candidates
  survive the cut — but the dense leg HAD scored candidates — the top-N
  dense hits by cosine are appended to the pool. They then compete
  normally in fusion scoring; nothing else about them is marked
  different. 0 disables (legacy gate-only). Default 8 ON, justified by
  the two measurements above: the floor makes the dense leg degrade
  gracefully instead of dying.
* Plumbed config.py (RetrievalConfig field + TOML loader) ->
  context_manager open_read_source kwargs -> KnowledgeStore ctor,
  following the dense_additive_weight pattern; routes_admin hot-swap
  path kept in sync. In sharded mode the kwarg fans to every per-shard
  KnowledgeStore (the gate is store-level; the V1 ShardedGenomeAdapter
  ANN surface is unreachable), so the floor holds per shard and the
  router-level merge dedups as usual. Documented in helix.toml.

Fix B — true-pair calibration cap (the root cause):

* scripts/calibrate_thresholds.py gains a true-pair guard: sample M
  genes (--true-pairs, default 200; 0 disables), synthesize a query per
  gene from its leading ~12 content tokens (the auto-synth convention of
  benchmarks/sweep_dense_additive_weight.py), encode with the same
  BGE-M3 codec the runtime uses (task="query"), and measure TRUE
  query-doc cosines against each gene's own stored vector. Final
  threshold = min(random_pair_mu_plus_sigma, P05(true_pairs) - 0.02).
* Both bounds + the winning bound are recorded in the TOML snippet, the
  calibration_report.json (ann_threshold.random_pair_threshold +
  ann_threshold.true_pair_guard), and the genome_calibration DB payload
  (extra keys are ignored by _get_effective_ann_threshold and surface
  through the provenance/diagnostics endpoints).
* If the codec/model is unavailable (no GPU / no transformers), the
  guard is skipped with a WARNING and the legacy random-pair value is
  emitted — calibration never hard-requires a model, and tests do not
  need one. Also added the standard repo-root sys.path shim so the
  script (and the guard) work from a source checkout without
  `pip install -e .`.

Tests — tests/test_dense_pool_floor.py, 21 cases, no model needed:

* apply_ann_gate unit tests on synthetic cosine arrays: threshold above
  all candidates -> top-8 admitted; healthy gate (>= 8 passing) ->
  byte-identical to legacy; floor=0 -> legacy empty result; fewer than N
  total dense candidates -> all admitted; rescued ordering follows
  cosine; min_genes rescues stay ahead of floor appends; floor counts
  already-admitted dense and only tops up; empty dense leg -> no-op.
* Config plumbing: default 8, TOML load (incl. 0-disables), reaches the
  store, store default.
* Calibration cap math (pure functions): true-pair cap wins for the
  #214 shape (0.779 vs 0.46-0.68 true pairs) and lands below the real
  cosine max; random-pair bound wins when already below; missing true
  pairs -> legacy value + guard_skipped flag; skip/cap thread end-to-end
  through apply_true_pair_guard into emit_report/emit_toml_snippet;
  query synthesis token rule.
* Integration-lite: in-memory Genome (conftest FakeBGEM3Codec +
  hash_vec, mirroring test_dense_recall.py fixtures), ann threshold
  0.99: dense candidates still surface through query_docs_ann with the
  floor on, and are gated out entirely with floor=0.

No existing test needed adjustment: the suite ran green with the floor
default ON (2197 passed; the only 4 failures are pre-existing and
environment-only — missing transformers / authority-boost cases — and
reproduce identically on the base commit).
@mbachaud mbachaud merged commit 9af4c33 into master Jun 12, 2026
3 checks passed
@mbachaud mbachaud deleted the fix/214-dense-pool-floor branch June 12, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ANN-threshold calibration method gates the dense leg out of pool construction (0/5000 clear the calibrated 0.779)

1 participant