fix(public): de-hardcode wave 1 + retrieval-profiles spec + evidence roadmap#201
Merged
Conversation
…ing/config, generic path-affinity rescue Public-product readiness pass. The shipped defaults baked the original author's private project vocabulary and machine paths into runtime retrieval surfaces. That matters beyond cosmetics: CpuTagger's EntityRuler patterns and _TECH_TERMS become promoter tags at ingest, and promoter tags drive the Tier-1 tag_exact retrieval tier (tag_exact_weight = 3.0) — so any public deployment was spending its strongest lexical tier matching another person's project names. Likewise lexical_rescue's path bonus hardwired +2.0 boosts for "helix-context" paths and "/helix.toml", biasing source rescue toward the product's own repository on every corpus. Changes: * tagger.py — removed owner-project entries from _PROJECT_ENTITIES (BigEd, BigEd CC, BookKeeper, CosmicTasha, two-brain, ModuleHub, Dr. Ders, FleetDB, fleet.toml as PRODUCT; SwiftWing21 as ORG) and from _TECH_TERMS (biged, bookkeeper, cosmictasha, modulehub, fleetdb, "dr ders"). The product's own stack (helix, agentome, scorerift, splade, ...) and generic IR/NLP vocabulary stay. New module-level _extend_vocabulary_from_env() reads HELIX_TAGGER_EXTRA_ENTITIES (comma-separated "Label:Text" pairs) and HELIX_TAGGER_EXTRA_TERMS (comma-separated lowercased terms) once at import so operators re-add deployment vocabulary without forking; malformed pairs are skipped with a warning. Comments note the previously owner-specific defaults removed for public release. * retrieval/lexical_rescue.py — dropped the hardwired helix-in-query + "helix-context"-in-path +2.0, the "/helix.toml" +2.0, and the owner-specific "/_worktrees/" -1.0 penalty. Replaced with one corpus-neutral heuristic: +2.0 (once) when any query term of len >= 4 names a whole path segment or filename stem of the candidate — "acme" boosts /acme/ and acme.toml exactly the way "helix" boosts helix.toml. Existing generic scores (config-ext +1.5/+1.0, substring agreement +0.4, tests-path -0.75) unchanged. Module docstring now states the full scoring contract. * helix.toml — [synonyms]: removed owner rows biged / bookkeeper / cosmictasha / fleet, dropped "scorerift" from the scoring row and "biged" from the agents row; generic software vocabulary kept. [mem_sync]: personal watch_dirs path replaced with an empty list plus an example comment. * launcher templates (dashboard.html, components/database_panel.html) — placeholder F:\path\to\new.genome.db -> path/to/new.genome.db. * bridge.py — generated SHARED_CONTEXT.md "How to Use" line now renders self.helix_base_url instead of hardcoding http://127.0.0.1:11437. * launcher/installer.py + launcher/app.py — install_service() grew a defaulted port param (11438) used in the printed dashboard URL, threaded from the CLI's existing --port flag through _handle_service_command. Tests: new tests/test_public_dehardcode.py (16 tests) pins (a) no owner vocabulary in tagger patterns/terms, (b) env-var extension via the importlib.reload pattern, (c) the generic path-segment boost, structural score equality between helix-context and acme-context paths, worktrees penalty removed, tests penalty kept, and end-to-end rescue preference for a query-term path match, (d) helix.toml parses with no owner synonym keys and empty watch_dirs. Owner-named fixtures replaced with generic equivalents in test_density_gate.py, test_shard_router.py, test_bench_needles.py; test_launcher_installer.py call-through assertion updated to expect port=11438. Out of scope, noted for wave 2: com.swiftwing21.helix-launcher plist id (installer + deploy templates + tests pin it), "swiftwing" org fixtures in test_server/test_mcp_server, biged_skills empirical-curve comments in test_abstain_tier.py / pipeline/tier_logic.py, and docstring examples in knowledge_store.py / storage/ddl.py.
…-steps roadmap From the 2026-06-09 research/examine/document loop (3 parallel research agents over issues+PRs, hardcoding audit, and tuning-evidence study). Answers the profiles question: one pipeline cannot serve all workloads, but only ~6 knobs are corpus-sensitive — design = auto-calibration + classifier-owned query-time adaptation + 3 small corpus profiles.
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Output of the 2026-06-09 research → examine → document loop (3 parallel research agents: issues/PRs evidence audit, public-product hardcoding audit, retrieval-tuning evidence study).
De-hardcode wave 1 (runtime retrieval code)
_PROJECT_ENTITIES/_TECH_TERMS— these drive the tag_exact tier, so every public user's corpus was being tagged against the owner's project names. Operators extend viaHELIX_TAGGER_EXTRA_ENTITIES("Label:Text" pairs) /HELIX_TAGGER_EXTRA_TERMS.helixquery +helix-contextpath +2.0,/helix.toml+2.0) and the owner's/_worktrees/−1.0 layout penalty; replaced with a generic +2.0 when a query term (len≥4) names a path segment/filename stem. Structural bias toward the product's own repo is gone (helix/acme now score identically).mem_sync.watch_dirspath scrubbed.F:\path\to\new.genome.db→path/to/new.genome.db.helix_base_url(was hardcoded :11437); installer service message threads the real--port.Docs (same loop)
docs/specs/2026-06-09-retrieval-profiles.md— the profiles/modes answer: one pipeline can't serve all workloads, but only ~6 knobs are corpus-sensitive (dense_additive_weight ±19pp on prose, SPLADE 21.1% disk for 0pp at 850K, non-transferable absolute thresholds, filename anchor +24pp code-only, synonyms, abstain floors). Design = Layer 1 auto-calibration (genome_calibration machinery exists) + Layer 2 classifier-owned semantic arm (kills the HELIX_SEMANTIC_ARM env gate) + Layer 3 three small corpus profiles (code/prose/small-mixed). Blocking measurements ranked.docs/audits/2026-06-09-next-steps-evidence.md— ranked roadmap with evidence (500K rebuild+bench vs published 68.4/68.8/72.4 baselines, Audit fingerprint routing index (path_key_index) storage: 34.1% of v2 corpus #165 verification + residual ~2–3 GB schema wins, the two deferred default-flips, Wall-2 orphaned PRs perf(retrieval): parallel cross-shard fan-out + lifted SPLADE query encoding #158/feat(retrieval): SPLADE pre-filter for dense matmul (Issue #159 Wall-2) #160, hardcoding wave 2 list).Test plan
test_public_dehardcode.py): owner-vocab absence, env extension, generic-vs-helix score equality, template neutrality