Skip to content

feat: source semantic hardening — A→B→C staged L2 with domain + few-shot#63

Merged
deanban merged 20 commits into
mainfrom
dean/feat/source-semantic-hardening
Apr 21, 2026
Merged

feat: source semantic hardening — A→B→C staged L2 with domain + few-shot#63
deanban merged 20 commits into
mainfrom
dean/feat/source-semantic-hardening

Conversation

@deanban

@deanban deanban commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the monolithic single-pass / two-pass L2 semantic interpretation with a staged A→B→C→merge pipeline, adds a domain layer (healthcare first), a few-shot example library, and an evaluation harness.

  • 69 files (excluding eval artifacts), +10,468 / −1,427 lines of code + tests
  • 1004 unit tests, 87% coverage, mypy clean
  • Closes rollout steps 1–5 + 7 (cleanup) of the source-semantic-hardening OpenSpec change; steps 6 / holdout bias validated on 12-table POC slice, full 33-table corpus gated on remaining cBioPortal ingest

What's in it

Staged L2 pipeline (§2–4)

  • StageAResult, StageBColumnResult, StageBBatchResult, StageBResult, StageCResult, StageCBatchResult, StageStatus schemas (src/sema/models/stages.py)
  • Stage A: entity + grain hypothesis
  • Stage B: property classification with bounded recovery (retry, batch-split, Tier-1 rescue) and B_SUCCESS / B_PARTIAL / B_FAILED outcomes
  • Stage C: conditional value decoding with deterministic trigger (skips identifiers / timestamps / free-text / unresolved B columns)
  • Merge step with explicit ownership matrix — A proposes entity, B owns property/semantic_type/alias, C exclusively owns has_decoded_value; no vocabulary_match emitted from L2

Domain layer (§1, §7)

  • DomainContext, DomainCandidate models + --domain CLI flag + config + profiler-based detection (src/sema/models/domain.py)
  • Precedence: CLI > config > profiler > default
  • Domain-aware prompt composition: healthcare vs. generic semantic type inventories, vocabulary family hints, dual-domain softened headers on conflict

Few-shot library (§8)

  • 5 Stage A examples, 12 Stage B column examples, 8 Stage C decoding examples for healthcare
  • Fixes LLM alias-dropping regression by including realistic synonyms in examples (`783266d`)

Evaluation harness (§5–6)

  • `sema eval` CLI: dev-slice runner, structured diff, telemetry aggregator, milestone report
  • Per-stage telemetry: call counts, latencies, tokens, recovery metrics, B-outcome distribution, C trigger rate
  • Dev slice (`eval/dev_slice.yaml`) + holdout (`eval/holdout.yaml`) versioned in repo

Cleanup (§11)

  • Removed `PropertyInterpretation`, `TableInterpretation`, `_PropertyBatchResult`, `build_interpretation_prompt`, `build_simplified_interpretation_prompt`, `_needs_two_pass`, and all single-pass / two-pass scaffolding
  • Deleted `src/sema/engine/semantic_utils.py`; `SemanticEngine.interpret_table` reduced to thin staged wrapper
  • `BuildConfig.use_staged` removed — staged is the sole path

Milestone results (12-table POC slice)

See `eval-runs/step6-milestone-summary.md`.

metric value budget status
B outcome distribution 12 success / 0 partial / 0 failed PASS
Raw / critical coverage 100% / 100% PASS
Stage C trigger rate 30.7% avg (95/259 cols)
Recovery overhead 0 retries, 0 splits, 0 rescues
Cost (DeepSeek) $0.0048 / table $0.10 / table PASS (21× under)
Latency 23.1 s / table 60 s / table PASS (2.6× under)

Every removal cluster flagged across the 5-step rollout is root-caused and either design-intended (`vocabulary_match` → L3 per design §2a; `has_decoded_value` restored at step 5) or fixed (`BIOTYPE (STRING)` column-name leak in `46384de`, alias regression in `783266d`). No open systemic regressions; no high-value predicates lost.

What's still open

All blocked on ingesting the remaining ~21 cBioPortal tables (see `§11-bis Pending ingest` in `openspec/changes/source-semantic-hardening/tasks.md`):

  • 10.1 Run full 33-table corpus
  • 10.4 / 8.8 Holdout-vs-dev-slice bias check (8 of 10 holdout tables not ingested; 2 contaminated)

Known issue discovered during spot-check

`patient.SUBTYPE=GBM_IDHmut-non-codel` gets decoded by Stage C as `"Glioblastoma, IDH mutant, non-codisplayed (non-codel)"`. "Non-codisplayed" is an LLM hallucination — the correct clinical term is non-codeleted (1p/19q codeletion status in glioma classification). Low-severity single-assertion issue; points to a real gap — Stage C emits a `codebook_lookup_needed` flag but nothing consumes it. Will be filed as a follow-up issue.

Test plan

  • `uv run pytest` — 1004 passed, 1 skipped, 87% coverage
  • `uv run mypy src/sema/` — clean on 94 source files
  • Dev-slice run on 12 tables (`eval-runs/step5-post-cleanup/`) — no systemic regressions vs. pre-cleanup
  • Spot-check on 6 of 12 tables — entity names, property names, semantic types, decoded values all sensible except the single GBM hallucination noted above
  • Full 33-table corpus run — blocked on Databricks ingest
  • Holdout bias check — blocked on ingest

@deanban

deanban commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Scope note: The two unchecked items in the test plan —

  • Full 33-table cBioPortal corpus run
  • Holdout-vs-dev-slice bias check

— are blocked on Databricks reactivation and ingestion of the remaining ~21 cBioPortal tables (see §11-bis Pending ingest in tasks.md). They will be handled in a subsequent PR, tracked as issue #72. This PR is scoped to the 12-table POC slice sign-off.

deanban added 19 commits April 21, 2026 13:02
Domain precedence: CLI > config > profiler > default.
Profiler evidence preserved when CLI/config overrides.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
StageAResult, StageBColumnResult, StageBBatchResult, StageCResult,
StageCBatchResult, StageStatus, UnresolvedColumn models.
Stage A/B prompt builders with domain context slots.
Critical column identification, coverage computation, B pass/fail logic.
Stage C deterministic trigger with low-cardinality fallback.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
interpret_table_staged() runs full A→B→C→merge.
Merge ownership matrix: A=entity, B=property, C=decoded values.
Bounded B recovery: retry, split, Tier 1 rescue.
semantic_unresolved produced for low-confidence ambiguous columns.
VocabColumnContext enriched with B output at version 1.
use_staged=True default with PromptLayers rollout flags.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Assertion dump/load for checkpoint comparison.
Structured diff with regression flagging.
TableTelemetry/PipelineTelemetry with milestone report builder.
13-table dev slice and 10-table holdout definitions.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Domain bias header with conflict handling.
Healthcare/generic semantic type inventories.
Vocabulary family hints for healthcare domain.
5 Stage A, 12 Stage B, 8 Stage C few-shot examples.
Holdout disjointness verified.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging
area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL +
vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with
optional COPY INTO when a cloud staging URI is configured.

- `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push
- `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands
- `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`,
  `IngestDatabricksTargetConfig` with env-prefix settings
- `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps
- `.env.example`: documents INGEST_* env vars

Unit coverage across parsers, staging lifecycle, Databricks bridge
provisioning, and CLI wiring (63 tests).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Makes rollout steps 2–6 of source-semantic-hardening executable with a
single command per step. Per-table assertion + telemetry dumps are
written when `eval_dump_dir` is set on BuildConfig; `slice_tables`
filters discovered work items to a named subset.

- `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the
  pipeline on the slice and writes `<table>__<label>.json` + paired
  `<table>__<label>__telemetry.json` per table.
- `sema eval diff --baseline <d> --current <d>`: pairs dumps by table
  and aggregates semantic churn using the existing `diff_dumps` keyed
  on `(subject_ref, predicate)` — covers L2 and L3 assertions.
- `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates
  telemetry across tables and, if a baseline dir is given, folds in the
  churn summary — produces a milestone-ready JSON report.

Wiring:
- `BuildConfig` gains `eval_dump_dir`, `eval_config_label`,
  `slice_tables`.
- `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls
  the dump hook after successful runs; failures to dump are logged, not
  raised.
- `_run_pipeline_stages` now returns `(assertions, staged_output)` so
  `process_table` can extract telemetry without a second pass.
- `_discover_tables` filters via `_filter_work_items_to_slice` when
  `slice_tables` is set.

Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact
commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
TableTelemetry previously carried zeros for stage_*_latency_ms and
tokens_*. The staged engine's LLM calls were not measured and the
kwargs to from_stages() were never passed. Eval runs reported
avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and
60s/table latency gate unverifiable.

- `LLMClient` gains an `InvocationStats` dataclass populated on every
  `invoke()`: wall-clock duration in ns, prompt/response char counts,
  and prompt/completion token counts (pulled from `usage_metadata` /
  `response_metadata` when present, else a ~4 chars/token estimate).
- `SemanticEngine.interpret_table_staged_with_metrics()` wraps the
  client's `invoke` for the duration of a staged run so that every
  batched Stage B call and every Stage C column call contributes to
  the table's `StageMetrics` (tokens per stage via accumulation,
  latency per stage via `time.monotonic_ns` bookends).
- `interpret_table_staged` preserved as a thin wrapper so existing
  callers are unchanged.
- `build_utils._run_semantic_interpretation` now threads the metrics
  through to `TableTelemetry.from_stages()`.

Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533,
tokens input = 20983, tokens output = 23417 — well under the
cost/latency gates.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…ortal ingest

Six tables actually loaded in workspace.cbioportal (patient, sample,
mutation, timeline_sample_acquisition, timeline_status,
timeline_treatment) — a subset of the 13 tables in dev_slice.yaml.
Used for initial rollout evaluation runs until full ingest lands.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Caught in the step-2 dev-slice eval: Stage B occasionally returned
column names with an embedded type spec (e.g. 'BIOTYPE (STRING)',
'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column
field verbatim to build '<table_ref>.<col>' subject refs, so a single
noisy column silently severed the link between L2 property
assertions and the extractor's COLUMN_EXISTS assertions. The
downstream effect was a 'regression_risk' removal in the diff tool.

Adds `sanitize_column_name` (strips the first whitespace / paren /
bracket / colon onward) and applies it to every StageBColumnResult
returned by `_invoke_stage_b_batch` before it reaches the merge or
vocab-context builder. LLM non-determinism occasionally skips the
leak entirely (step 3 domain-aware had zero) but the fix is cheap
insurance and costs nothing on clean output.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain-
aware). Root cause: none of the 12 Stage B few-shot examples in
few_shot.py populated a `synonyms` field. The LLM imitated the
examples' empty-by-omission pattern and dropped aliases that step 3
was emitting.

Changes:
- Add realistic `synonyms` lists to 8 of 12 Stage B examples
  (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol,
  variant_classification, agent, stage_highest). Examples without
  synonyms remain to demonstrate empty-is-valid.
- Switch `format_examples` to compact JSON (no indent) — recoups most
  of the token cost added by the synonyms.

Measured impact on 6-table dev slice:
- Alias regression 52 → 16 (−69%)
- Output tokens 22,935 → 23,566 (+631, LLM restored alias emission)
- Input tokens 41,623 → 41,148 (−475, compact JSON)
- All 6 tables still B_SUCCESS with 100% coverage

The +17k input token bump from enabling few-shot in step 4 is the
fixed cost of including the full Stage A+B+C blocks in each of 18+
LLM calls per slice run — not a bug, just the price of few-shot.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Versions the per-table assertion dumps, telemetry dumps, diff reports,
and milestone reports produced during the source-semantic-hardening
rollout. These back up the task completion claims in
openspec/changes/source-semantic-hardening/tasks.md (which is in a
gitignored path) and serve as a reference baseline for future
evaluation runs.

Contents of eval-runs/:
- step2-baseline-single-pass/  # pre-decomposition reference
- step2-staged-zeroshot/        # A→B decomposition, zero-shot
- step3-domain-aware/           # + domain bias / type inventory / vocab hints
- step4-few-shot/               # + healthcare few-shot (post alias-fix)
- step5-stage-c/                # + Stage C value decoding (full pipeline)
- step{2,3,4,5}-diff.json       # churn summaries vs prior step
- step{2,3,4,5}-report.json     # per-step milestone reports
- end-to-end-diff.json          # baseline → full pipeline delta

Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting
current Databricks ingest. Holdout and full-corpus runs blocked on
ingest of the remaining 27 cBioPortal tables — see §11-bis in
tasks.md.

eval-runs/*.log added to .gitignore (transient runtime output).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
… parsers

Extends the cBioPortal ingest to cover five new file types, unlocking
the remaining dev-slice tables (structural_variant, cna,
gene_panel_matrix, resource_definition/patient, clinical_supp_*).

New parsers:
- parse_sv_file — data_sv.txt → structural_variant (position/
  entrez-gene-id columns typed as BIGINT via sv_column_type helper)
- parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to
  long format with sample_id / hugo_symbol / entrez_gene_id /
  cna_value. Blank cells become nulls. cna_long_format_rows helper
  lives in cbioportal_utils.py.
- parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is
- parse_resource_file — data_resource_*.txt (definition and
  per-patient/sample entries)

Ingest orchestration:
- _should_download now allows data_sv.txt, data_cna.txt,
  data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_*
  via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES /
  EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py
- SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix
  files (expression, methylation, log2/linear/armlevel CNA, mrna,
  rppa)
- _ingest_study_dir wires three new fixed-file parsers
  (_try_ingest_fixed_files) plus prefix-matched passes for
  data_resource_* and data_clinical_supp_*
  (_ingest_prefix_matched_files)

Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now
holds 12 cbioportal tables including cna (14.4M long-format rows
pivoted from ~24k genes × ~600 samples), structural_variant (510
rows), gene_panel_matrix, resource_definition/patient, and
clinical_supp_hypoxia.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Now that the cBioPortal ingest has been extended to cover SV, CNA,
gene-panel matrix, resources, and clinical supplements, the dev slice
grows from the original 6-table POC (patient, sample, mutation, 3
timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018.

Full A→B→C staged pipeline results on all 12 tables:

- 12/12 B_SUCCESS, 100% raw and critical coverage across every table
- 0 retries, 0 splits, 0 rescues — zero recovery overhead
- 69 Stage C calls → 195 has_decoded_value assertions
- 259 has_property_name assertions (up from 222 on the 6-table slice)
- Avg latency 25.2s / table (peak 105s on mutation's 114 columns,
  still under the 60s gate)
- Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under
  the $0.10/table gate)

Spot-checks on the four new table types:
- structural_variant: correct entity "Structural Variant" with grain
  "one row per structural variant ... per sample"; Stage C correctly
  decoded in-frame vs frameshift mutation semantics
- cna (long format): 4 columns classified as sample_id /
  hugo_symbol / entrez_gene_id / cna_value, one Stage C call
- gene_panel_matrix, resource_definition, resource_patient: all
  identifier-heavy tables classified as expected

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
The staged A→B→C pipeline is proven on the 12-table dev slice and
becomes the sole L2 path. Ripping out everything the rollout kept
around through step 6.

Removed from src/sema/engine/semantic.py:
- PropertyInterpretation and TableInterpretation (old response schemas)
- _PropertyBatchResult (two-pass batch schema)
- build_interpretation_prompt, build_simplified_interpretation_prompt
- build_summary_prompt, build_property_prompt
- _needs_two_pass, _interpret_two_pass
- _interpret_via_llm_client, _interpret_via_raw_llm
- _run_summary_pass, _run_property_pass
- _entity_assertions, _property_assertions,
  _interpretation_to_assertions
- SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg

Removed files:
- src/sema/engine/semantic_utils.py (entire file — all legacy helpers)
- tests/unit/test_two_pass_semantic.py (legacy path tests)

Reshaped:
- SemanticEngine.interpret_table now delegates to
  interpret_table_staged_with_metrics and returns just assertions —
  one staged path for every table regardless of width
- pipeline.build_utils._run_semantic_interpretation drops the
  use_staged branch; always returns (assertions, _StagedOutput)
- pipeline.build._run_pipeline_stages returns
  (assertions, staged_output) unconditionally
- process_table, _spawn_workers*, and BuildConfig lose the
  use_staged flag; cli_eval drops --use-staged/--no-use-staged
- Tests updated to mock the staged sequence (StageAResult +
  StageBBatchResult) instead of TableInterpretation

Test suite: 1004/1004 passing, mypy clean on 94 source files.
Test count dropped from 1041 → 1004 (the 37 removed tests all
exercised the deprecated legacy path).

Follow-up not addressed here: semantic.py (520) and build_utils.py
(508) both exceed the project's 400-line file standard. They were
already over (745 and 514 pre-cleanup). Splitting them is a separate
refactor — the simplest next step is extracting
interpret_table_staged_with_metrics + the Stage A/B/C runners into
stage_utils.py, which shaves ~200 lines from semantic.py.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Validates that Task 11 refactor (legacy L2 code removal, 17-file diff,
-1494 LOC) did not alter pipeline behavior.

Results vs pre-cleanup step 5 v2:
- 12/12 tables B_SUCCESS, 100% coverage, zero recovery
- 259 has_property_name (identical to pre-cleanup)
- 12 has_entity_name (identical)
- 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic
  LLM variation well within run-to-run noise
- Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise)
- Diff: 23 added / 22 removed — symmetric, indicates zero regression

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Full staged pipeline on the 12-table slice, Neo4j wiped first.

Pipeline:
- 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery
- 12 entities, 259 properties, 174 decoded values, 81 Stage C calls
- 285s total / 23.8s avg, tokens 73,346 in + 34,614 out
- Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget

Neo4j state (3,755 nodes after materialization):
- Catalog/Schema/DataSource: 1 each
- Table: 12 ✓
- Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number
  Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia
  Assessment', 'Patient Status Event', 'Sample Acquisition Event',
  'Sample Genomic Profile Availability', 'Treatment Event', etc.)
- Column: 259 ✓
- Property: 259 ✓
- ValueSet: 150 / Term: 290 (from Stage C)
- Alias: 452 / Vocabulary: 143 (from L3)
- Assertion provenance: 2,175
- Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE,
  CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present

Diff vs pre-cleanup baseline (step5-stage-c-v2):
- 45 added, 24 removed, 678 changed
- Added: 18 aliases + 27 decoded values (Stage C picked more columns)
- Removed: 14 decoded values + 10 aliases (LLM variation)
- Zero high-value regressions (no property_name / semantic_type /
  entity_name losses)

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
@deanban deanban force-pushed the dean/feat/source-semantic-hardening branch from 174fe41 to a047c78 Compare April 21, 2026 17:02
@deanban deanban merged commit aed40d3 into main Apr 21, 2026
3 checks passed
@deanban deanban deleted the dean/feat/source-semantic-hardening branch April 21, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant