feat: source semantic hardening — A→B→C staged L2 with domain + few-shot by deanban · Pull Request #63 · Nine-Sigma/sema

deanban · 2026-04-21T16:55:07Z

Summary

Replaces the monolithic single-pass / two-pass L2 semantic interpretation with a staged A→B→C→merge pipeline, adds a domain layer (healthcare first), a few-shot example library, and an evaluation harness.

69 files (excluding eval artifacts), +10,468 / −1,427 lines of code + tests
1004 unit tests, 87% coverage, mypy clean
Closes rollout steps 1–5 + 7 (cleanup) of the source-semantic-hardening OpenSpec change; steps 6 / holdout bias validated on 12-table POC slice, full 33-table corpus gated on remaining cBioPortal ingest

What's in it

Staged L2 pipeline (§2–4)

StageAResult, StageBColumnResult, StageBBatchResult, StageBResult, StageCResult, StageCBatchResult, StageStatus schemas (src/sema/models/stages.py)
Stage A: entity + grain hypothesis
Stage B: property classification with bounded recovery (retry, batch-split, Tier-1 rescue) and B_SUCCESS / B_PARTIAL / B_FAILED outcomes
Stage C: conditional value decoding with deterministic trigger (skips identifiers / timestamps / free-text / unresolved B columns)
Merge step with explicit ownership matrix — A proposes entity, B owns property/semantic_type/alias, C exclusively owns has_decoded_value; no vocabulary_match emitted from L2

Domain layer (§1, §7)

DomainContext, DomainCandidate models + --domain CLI flag + config + profiler-based detection (src/sema/models/domain.py)
Precedence: CLI > config > profiler > default
Domain-aware prompt composition: healthcare vs. generic semantic type inventories, vocabulary family hints, dual-domain softened headers on conflict

Few-shot library (§8)

5 Stage A examples, 12 Stage B column examples, 8 Stage C decoding examples for healthcare
Fixes LLM alias-dropping regression by including realistic synonyms in examples (`783266d`)

Evaluation harness (§5–6)

`sema eval` CLI: dev-slice runner, structured diff, telemetry aggregator, milestone report
Per-stage telemetry: call counts, latencies, tokens, recovery metrics, B-outcome distribution, C trigger rate
Dev slice (`eval/dev_slice.yaml`) + holdout (`eval/holdout.yaml`) versioned in repo

Cleanup (§11)

Removed `PropertyInterpretation`, `TableInterpretation`, `_PropertyBatchResult`, `build_interpretation_prompt`, `build_simplified_interpretation_prompt`, `_needs_two_pass`, and all single-pass / two-pass scaffolding
Deleted `src/sema/engine/semantic_utils.py`; `SemanticEngine.interpret_table` reduced to thin staged wrapper
`BuildConfig.use_staged` removed — staged is the sole path

Milestone results (12-table POC slice)

See `eval-runs/step6-milestone-summary.md`.

metric	value	budget	status
B outcome distribution	12 success / 0 partial / 0 failed	—	PASS
Raw / critical coverage	100% / 100%	—	PASS
Stage C trigger rate	30.7% avg (95/259 cols)	—	—
Recovery overhead	0 retries, 0 splits, 0 rescues	—	—
Cost (DeepSeek)	$0.0048 / table	$0.10 / table	PASS (21× under)
Latency	23.1 s / table	60 s / table	PASS (2.6× under)

Every removal cluster flagged across the 5-step rollout is root-caused and either design-intended (`vocabulary_match` → L3 per design §2a; `has_decoded_value` restored at step 5) or fixed (`BIOTYPE (STRING)` column-name leak in `46384de`, alias regression in `783266d`). No open systemic regressions; no high-value predicates lost.

What's still open

All blocked on ingesting the remaining ~21 cBioPortal tables (see `§11-bis Pending ingest` in `openspec/changes/source-semantic-hardening/tasks.md`):

10.1 Run full 33-table corpus
10.4 / 8.8 Holdout-vs-dev-slice bias check (8 of 10 holdout tables not ingested; 2 contaminated)

Known issue discovered during spot-check

`patient.SUBTYPE=GBM_IDHmut-non-codel` gets decoded by Stage C as `"Glioblastoma, IDH mutant, non-codisplayed (non-codel)"`. "Non-codisplayed" is an LLM hallucination — the correct clinical term is non-codeleted (1p/19q codeletion status in glioma classification). Low-severity single-assertion issue; points to a real gap — Stage C emits a `codebook_lookup_needed` flag but nothing consumes it. Will be filed as a follow-up issue.

Test plan

`uv run pytest` — 1004 passed, 1 skipped, 87% coverage
`uv run mypy src/sema/` — clean on 94 source files
Dev-slice run on 12 tables (`eval-runs/step5-post-cleanup/`) — no systemic regressions vs. pre-cleanup
Spot-check on 6 of 12 tables — entity names, property names, semantic types, decoded values all sensible except the single GBM hallucination noted above
Full 33-table corpus run — blocked on Databricks ingest
Holdout bias check — blocked on ingest

deanban · 2026-04-21T17:00:39Z

Scope note: The two unchecked items in the test plan —

Full 33-table cBioPortal corpus run
Holdout-vs-dev-slice bias check

— are blocked on Databricks reactivation and ingestion of the remaining ~21 cBioPortal tables (see §11-bis Pending ingest in tasks.md). They will be handled in a subsequent PR, tracked as issue #72. This PR is scoped to the 12-table POC slice sign-off.

Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

StageAResult, StageBColumnResult, StageBBatchResult, StageCResult, StageCBatchResult, StageStatus, UnresolvedColumn models. Stage A/B prompt builders with domain context slots. Critical column identification, coverage computation, B pass/fail logic. Stage C deterministic trigger with low-cardinality fallback. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

interpret_table_staged() runs full A→B→C→merge. Merge ownership matrix: A=entity, B=property, C=decoded values. Bounded B recovery: retry, split, Tier 1 rescue. semantic_unresolved produced for low-confidence ambiguous columns. VocabColumnContext enriched with B output at version 1. use_staged=True default with PromptLayers rollout flags. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Assertion dump/load for checkpoint comparison. Structured diff with regression flagging. TableTelemetry/PipelineTelemetry with milestone report builder. 13-table dev slice and 10-table holdout definitions. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Domain bias header with conflict handling. Healthcare/generic semantic type inventories. Vocabulary family hints for healthcare domain. 5 Stage A, 12 Stage B, 8 Stage C few-shot examples. Holdout disjointness verified. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL + vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with optional COPY INTO when a cloud staging URI is configured. - `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push - `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands - `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`, `IngestDatabricksTargetConfig` with env-prefix settings - `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps - `.env.example`: documents INGEST_* env vars Unit coverage across parsers, staging lifecycle, Databricks bridge provisioning, and CLI wiring (63 tests). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Makes rollout steps 2–6 of source-semantic-hardening executable with a single command per step. Per-table assertion + telemetry dumps are written when `eval_dump_dir` is set on BuildConfig; `slice_tables` filters discovered work items to a named subset. - `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the pipeline on the slice and writes `<table>__<label>.json` + paired `<table>__<label>__telemetry.json` per table. - `sema eval diff --baseline <d> --current <d>`: pairs dumps by table and aggregates semantic churn using the existing `diff_dumps` keyed on `(subject_ref, predicate)` — covers L2 and L3 assertions. - `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates telemetry across tables and, if a baseline dir is given, folds in the churn summary — produces a milestone-ready JSON report. Wiring: - `BuildConfig` gains `eval_dump_dir`, `eval_config_label`, `slice_tables`. - `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls the dump hook after successful runs; failures to dump are logged, not raised. - `_run_pipeline_stages` now returns `(assertions, staged_output)` so `process_table` can extract telemetry without a second pass. - `_discover_tables` filters via `_filter_work_items_to_slice` when `slice_tables` is set. Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

TableTelemetry previously carried zeros for stage_*_latency_ms and tokens_*. The staged engine's LLM calls were not measured and the kwargs to from_stages() were never passed. Eval runs reported avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and 60s/table latency gate unverifiable. - `LLMClient` gains an `InvocationStats` dataclass populated on every `invoke()`: wall-clock duration in ns, prompt/response char counts, and prompt/completion token counts (pulled from `usage_metadata` / `response_metadata` when present, else a ~4 chars/token estimate). - `SemanticEngine.interpret_table_staged_with_metrics()` wraps the client's `invoke` for the duration of a staged run so that every batched Stage B call and every Stage C column call contributes to the table's `StageMetrics` (tokens per stage via accumulation, latency per stage via `time.monotonic_ns` bookends). - `interpret_table_staged` preserved as a thin wrapper so existing callers are unchanged. - `build_utils._run_semantic_interpretation` now threads the metrics through to `TableTelemetry.from_stages()`. Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533, tokens input = 20983, tokens output = 23417 — well under the cost/latency gates. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…ortal ingest Six tables actually loaded in workspace.cbioportal (patient, sample, mutation, timeline_sample_acquisition, timeline_status, timeline_treatment) — a subset of the 13 tables in dev_slice.yaml. Used for initial rollout evaluation runs until full ingest lands. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Caught in the step-2 dev-slice eval: Stage B occasionally returned column names with an embedded type spec (e.g. 'BIOTYPE (STRING)', 'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column field verbatim to build '<table_ref>.<col>' subject refs, so a single noisy column silently severed the link between L2 property assertions and the extractor's COLUMN_EXISTS assertions. The downstream effect was a 'regression_risk' removal in the diff tool. Adds `sanitize_column_name` (strips the first whitespace / paren / bracket / colon onward) and applies it to every StageBColumnResult returned by `_invoke_stage_b_batch` before it reaches the merge or vocab-context builder. LLM non-determinism occasionally skips the leak entirely (step 3 domain-aware had zero) but the fix is cheap insurance and costs nothing on clean output. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain- aware). Root cause: none of the 12 Stage B few-shot examples in few_shot.py populated a `synonyms` field. The LLM imitated the examples' empty-by-omission pattern and dropped aliases that step 3 was emitting. Changes: - Add realistic `synonyms` lists to 8 of 12 Stage B examples (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol, variant_classification, agent, stage_highest). Examples without synonyms remain to demonstrate empty-is-valid. - Switch `format_examples` to compact JSON (no indent) — recoups most of the token cost added by the synonyms. Measured impact on 6-table dev slice: - Alias regression 52 → 16 (−69%) - Output tokens 22,935 → 23,566 (+631, LLM restored alias emission) - Input tokens 41,623 → 41,148 (−475, compact JSON) - All 6 tables still B_SUCCESS with 100% coverage The +17k input token bump from enabling few-shot in step 4 is the fixed cost of including the full Stage A+B+C blocks in each of 18+ LLM calls per slice run — not a bug, just the price of few-shot. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Versions the per-table assertion dumps, telemetry dumps, diff reports, and milestone reports produced during the source-semantic-hardening rollout. These back up the task completion claims in openspec/changes/source-semantic-hardening/tasks.md (which is in a gitignored path) and serve as a reference baseline for future evaluation runs. Contents of eval-runs/: - step2-baseline-single-pass/ # pre-decomposition reference - step2-staged-zeroshot/ # A→B decomposition, zero-shot - step3-domain-aware/ # + domain bias / type inventory / vocab hints - step4-few-shot/ # + healthcare few-shot (post alias-fix) - step5-stage-c/ # + Stage C value decoding (full pipeline) - step{2,3,4,5}-diff.json # churn summaries vs prior step - step{2,3,4,5}-report.json # per-step milestone reports - end-to-end-diff.json # baseline → full pipeline delta Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting current Databricks ingest. Holdout and full-corpus runs blocked on ingest of the remaining 27 cBioPortal tables — see §11-bis in tasks.md. eval-runs/*.log added to .gitignore (transient runtime output). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

… parsers Extends the cBioPortal ingest to cover five new file types, unlocking the remaining dev-slice tables (structural_variant, cna, gene_panel_matrix, resource_definition/patient, clinical_supp_*). New parsers: - parse_sv_file — data_sv.txt → structural_variant (position/ entrez-gene-id columns typed as BIGINT via sv_column_type helper) - parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to long format with sample_id / hugo_symbol / entrez_gene_id / cna_value. Blank cells become nulls. cna_long_format_rows helper lives in cbioportal_utils.py. - parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is - parse_resource_file — data_resource_*.txt (definition and per-patient/sample entries) Ingest orchestration: - _should_download now allows data_sv.txt, data_cna.txt, data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_* via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES / EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py - SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix files (expression, methylation, log2/linear/armlevel CNA, mrna, rppa) - _ingest_study_dir wires three new fixed-file parsers (_try_ingest_fixed_files) plus prefix-matched passes for data_resource_* and data_clinical_supp_* (_ingest_prefix_matched_files) Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now holds 12 cbioportal tables including cna (14.4M long-format rows pivoted from ~24k genes × ~600 samples), structural_variant (510 rows), gene_panel_matrix, resource_definition/patient, and clinical_supp_hypoxia. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Now that the cBioPortal ingest has been extended to cover SV, CNA, gene-panel matrix, resources, and clinical supplements, the dev slice grows from the original 6-table POC (patient, sample, mutation, 3 timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018. Full A→B→C staged pipeline results on all 12 tables: - 12/12 B_SUCCESS, 100% raw and critical coverage across every table - 0 retries, 0 splits, 0 rescues — zero recovery overhead - 69 Stage C calls → 195 has_decoded_value assertions - 259 has_property_name assertions (up from 222 on the 6-table slice) - Avg latency 25.2s / table (peak 105s on mutation's 114 columns, still under the 60s gate) - Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under the $0.10/table gate) Spot-checks on the four new table types: - structural_variant: correct entity "Structural Variant" with grain "one row per structural variant ... per sample"; Stage C correctly decoded in-frame vs frameshift mutation semantics - cna (long format): 4 columns classified as sample_id / hugo_symbol / entrez_gene_id / cna_value, one Stage C call - gene_panel_matrix, resource_definition, resource_patient: all identifier-heavy tables classified as expected Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

The staged A→B→C pipeline is proven on the 12-table dev slice and becomes the sole L2 path. Ripping out everything the rollout kept around through step 6. Removed from src/sema/engine/semantic.py: - PropertyInterpretation and TableInterpretation (old response schemas) - _PropertyBatchResult (two-pass batch schema) - build_interpretation_prompt, build_simplified_interpretation_prompt - build_summary_prompt, build_property_prompt - _needs_two_pass, _interpret_two_pass - _interpret_via_llm_client, _interpret_via_raw_llm - _run_summary_pass, _run_property_pass - _entity_assertions, _property_assertions, _interpretation_to_assertions - SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg Removed files: - src/sema/engine/semantic_utils.py (entire file — all legacy helpers) - tests/unit/test_two_pass_semantic.py (legacy path tests) Reshaped: - SemanticEngine.interpret_table now delegates to interpret_table_staged_with_metrics and returns just assertions — one staged path for every table regardless of width - pipeline.build_utils._run_semantic_interpretation drops the use_staged branch; always returns (assertions, _StagedOutput) - pipeline.build._run_pipeline_stages returns (assertions, staged_output) unconditionally - process_table, _spawn_workers*, and BuildConfig lose the use_staged flag; cli_eval drops --use-staged/--no-use-staged - Tests updated to mock the staged sequence (StageAResult + StageBBatchResult) instead of TableInterpretation Test suite: 1004/1004 passing, mypy clean on 94 source files. Test count dropped from 1041 → 1004 (the 37 removed tests all exercised the deprecated legacy path). Follow-up not addressed here: semantic.py (520) and build_utils.py (508) both exceed the project's 400-line file standard. They were already over (745 and 514 pre-cleanup). Splitting them is a separate refactor — the simplest next step is extracting interpret_table_staged_with_metrics + the Stage A/B/C runners into stage_utils.py, which shaves ~200 lines from semantic.py. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Validates that Task 11 refactor (legacy L2 code removal, 17-file diff, -1494 LOC) did not alter pipeline behavior. Results vs pre-cleanup step 5 v2: - 12/12 tables B_SUCCESS, 100% coverage, zero recovery - 259 has_property_name (identical to pre-cleanup) - 12 has_entity_name (identical) - 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic LLM variation well within run-to-run noise - Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise) - Diff: 23 added / 22 removed — symmetric, indicates zero regression Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Full staged pipeline on the 12-table slice, Neo4j wiped first. Pipeline: - 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery - 12 entities, 259 properties, 174 decoded values, 81 Stage C calls - 285s total / 23.8s avg, tokens 73,346 in + 34,614 out - Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget Neo4j state (3,755 nodes after materialization): - Catalog/Schema/DataSource: 1 each - Table: 12 ✓ - Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia Assessment', 'Patient Status Event', 'Sample Acquisition Event', 'Sample Genomic Profile Availability', 'Treatment Event', etc.) - Column: 259 ✓ - Property: 259 ✓ - ValueSet: 150 / Term: 290 (from Stage C) - Alias: 452 / Vocabulary: 143 (from L3) - Assertion provenance: 2,175 - Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE, CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present Diff vs pre-cleanup baseline (step5-stage-c-v2): - 45 added, 24 removed, 678 changed - Added: 18 aliases + 27 decoded values (Stage C picked more columns) - Removed: 14 decoded values + 10 aliases (LLM variation) - Zero high-value regressions (no property_name / semantic_type / entity_name losses) Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

deanban added 19 commits April 21, 2026 13:02

feat: add DomainContext model, CLI flag, and pipeline wiring

37178f9

Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

test: add Stage C trigger, execution, merge, and partial failure tests

7beea0a

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

chore: gitignore .wolf/ OpenWolf context directory

b2577f7

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

docs(eval): add step 6 milestone summary for 12-table POC slice

a047c78

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

deanban force-pushed the dean/feat/source-semantic-hardening branch from 174fe41 to a047c78 Compare April 21, 2026 17:02

deanban mentioned this pull request Apr 21, 2026

refactor: extract cBioPortal to showcase/ and add generic few-shot base #74

Merged

6 tasks

deanban merged commit aed40d3 into main Apr 21, 2026
3 checks passed

deanban deleted the dean/feat/source-semantic-hardening branch April 21, 2026 20:10

deanban mentioned this pull request Apr 23, 2026

feat: native Databricks Mosaic AI provider + cBioPortal showcase refactor #80

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: source semantic hardening — A→B→C staged L2 with domain + few-shot#63

feat: source semantic hardening — A→B→C staged L2 with domain + few-shot#63
deanban merged 20 commits into
mainfrom
dean/feat/source-semantic-hardening

deanban commented Apr 21, 2026

Uh oh!

deanban commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deanban commented Apr 21, 2026

Summary

What's in it

Milestone results (12-table POC slice)

What's still open

Known issue discovered during spot-check

Test plan

Uh oh!

deanban commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant