feat: source semantic hardening — A→B→C staged L2 with domain + few-shot#63
Merged
Conversation
This was referenced Apr 21, 2026
Contributor
Author
|
Scope note: The two unchecked items in the test plan —
— are blocked on Databricks reactivation and ingestion of the remaining ~21 cBioPortal tables (see |
Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
StageAResult, StageBColumnResult, StageBBatchResult, StageCResult, StageCBatchResult, StageStatus, UnresolvedColumn models. Stage A/B prompt builders with domain context slots. Critical column identification, coverage computation, B pass/fail logic. Stage C deterministic trigger with low-cardinality fallback. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
interpret_table_staged() runs full A→B→C→merge. Merge ownership matrix: A=entity, B=property, C=decoded values. Bounded B recovery: retry, split, Tier 1 rescue. semantic_unresolved produced for low-confidence ambiguous columns. VocabColumnContext enriched with B output at version 1. use_staged=True default with PromptLayers rollout flags. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Assertion dump/load for checkpoint comparison. Structured diff with regression flagging. TableTelemetry/PipelineTelemetry with milestone report builder. 13-table dev slice and 10-table holdout definitions. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Domain bias header with conflict handling. Healthcare/generic semantic type inventories. Vocabulary family hints for healthcare domain. 5 Stage A, 12 Stage B, 8 Stage C few-shot examples. Holdout disjointness verified. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL + vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with optional COPY INTO when a cloud staging URI is configured. - `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push - `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands - `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`, `IngestDatabricksTargetConfig` with env-prefix settings - `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps - `.env.example`: documents INGEST_* env vars Unit coverage across parsers, staging lifecycle, Databricks bridge provisioning, and CLI wiring (63 tests). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Makes rollout steps 2–6 of source-semantic-hardening executable with a single command per step. Per-table assertion + telemetry dumps are written when `eval_dump_dir` is set on BuildConfig; `slice_tables` filters discovered work items to a named subset. - `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the pipeline on the slice and writes `<table>__<label>.json` + paired `<table>__<label>__telemetry.json` per table. - `sema eval diff --baseline <d> --current <d>`: pairs dumps by table and aggregates semantic churn using the existing `diff_dumps` keyed on `(subject_ref, predicate)` — covers L2 and L3 assertions. - `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates telemetry across tables and, if a baseline dir is given, folds in the churn summary — produces a milestone-ready JSON report. Wiring: - `BuildConfig` gains `eval_dump_dir`, `eval_config_label`, `slice_tables`. - `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls the dump hook after successful runs; failures to dump are logged, not raised. - `_run_pipeline_stages` now returns `(assertions, staged_output)` so `process_table` can extract telemetry without a second pass. - `_discover_tables` filters via `_filter_work_items_to_slice` when `slice_tables` is set. Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
TableTelemetry previously carried zeros for stage_*_latency_ms and tokens_*. The staged engine's LLM calls were not measured and the kwargs to from_stages() were never passed. Eval runs reported avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and 60s/table latency gate unverifiable. - `LLMClient` gains an `InvocationStats` dataclass populated on every `invoke()`: wall-clock duration in ns, prompt/response char counts, and prompt/completion token counts (pulled from `usage_metadata` / `response_metadata` when present, else a ~4 chars/token estimate). - `SemanticEngine.interpret_table_staged_with_metrics()` wraps the client's `invoke` for the duration of a staged run so that every batched Stage B call and every Stage C column call contributes to the table's `StageMetrics` (tokens per stage via accumulation, latency per stage via `time.monotonic_ns` bookends). - `interpret_table_staged` preserved as a thin wrapper so existing callers are unchanged. - `build_utils._run_semantic_interpretation` now threads the metrics through to `TableTelemetry.from_stages()`. Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533, tokens input = 20983, tokens output = 23417 — well under the cost/latency gates. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…ortal ingest Six tables actually loaded in workspace.cbioportal (patient, sample, mutation, timeline_sample_acquisition, timeline_status, timeline_treatment) — a subset of the 13 tables in dev_slice.yaml. Used for initial rollout evaluation runs until full ingest lands. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Caught in the step-2 dev-slice eval: Stage B occasionally returned column names with an embedded type spec (e.g. 'BIOTYPE (STRING)', 'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column field verbatim to build '<table_ref>.<col>' subject refs, so a single noisy column silently severed the link between L2 property assertions and the extractor's COLUMN_EXISTS assertions. The downstream effect was a 'regression_risk' removal in the diff tool. Adds `sanitize_column_name` (strips the first whitespace / paren / bracket / colon onward) and applies it to every StageBColumnResult returned by `_invoke_stage_b_batch` before it reaches the merge or vocab-context builder. LLM non-determinism occasionally skips the leak entirely (step 3 domain-aware had zero) but the fix is cheap insurance and costs nothing on clean output. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain- aware). Root cause: none of the 12 Stage B few-shot examples in few_shot.py populated a `synonyms` field. The LLM imitated the examples' empty-by-omission pattern and dropped aliases that step 3 was emitting. Changes: - Add realistic `synonyms` lists to 8 of 12 Stage B examples (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol, variant_classification, agent, stage_highest). Examples without synonyms remain to demonstrate empty-is-valid. - Switch `format_examples` to compact JSON (no indent) — recoups most of the token cost added by the synonyms. Measured impact on 6-table dev slice: - Alias regression 52 → 16 (−69%) - Output tokens 22,935 → 23,566 (+631, LLM restored alias emission) - Input tokens 41,623 → 41,148 (−475, compact JSON) - All 6 tables still B_SUCCESS with 100% coverage The +17k input token bump from enabling few-shot in step 4 is the fixed cost of including the full Stage A+B+C blocks in each of 18+ LLM calls per slice run — not a bug, just the price of few-shot. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Versions the per-table assertion dumps, telemetry dumps, diff reports,
and milestone reports produced during the source-semantic-hardening
rollout. These back up the task completion claims in
openspec/changes/source-semantic-hardening/tasks.md (which is in a
gitignored path) and serve as a reference baseline for future
evaluation runs.
Contents of eval-runs/:
- step2-baseline-single-pass/ # pre-decomposition reference
- step2-staged-zeroshot/ # A→B decomposition, zero-shot
- step3-domain-aware/ # + domain bias / type inventory / vocab hints
- step4-few-shot/ # + healthcare few-shot (post alias-fix)
- step5-stage-c/ # + Stage C value decoding (full pipeline)
- step{2,3,4,5}-diff.json # churn summaries vs prior step
- step{2,3,4,5}-report.json # per-step milestone reports
- end-to-end-diff.json # baseline → full pipeline delta
Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting
current Databricks ingest. Holdout and full-corpus runs blocked on
ingest of the remaining 27 cBioPortal tables — see §11-bis in
tasks.md.
eval-runs/*.log added to .gitignore (transient runtime output).
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
… parsers Extends the cBioPortal ingest to cover five new file types, unlocking the remaining dev-slice tables (structural_variant, cna, gene_panel_matrix, resource_definition/patient, clinical_supp_*). New parsers: - parse_sv_file — data_sv.txt → structural_variant (position/ entrez-gene-id columns typed as BIGINT via sv_column_type helper) - parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to long format with sample_id / hugo_symbol / entrez_gene_id / cna_value. Blank cells become nulls. cna_long_format_rows helper lives in cbioportal_utils.py. - parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is - parse_resource_file — data_resource_*.txt (definition and per-patient/sample entries) Ingest orchestration: - _should_download now allows data_sv.txt, data_cna.txt, data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_* via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES / EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py - SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix files (expression, methylation, log2/linear/armlevel CNA, mrna, rppa) - _ingest_study_dir wires three new fixed-file parsers (_try_ingest_fixed_files) plus prefix-matched passes for data_resource_* and data_clinical_supp_* (_ingest_prefix_matched_files) Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now holds 12 cbioportal tables including cna (14.4M long-format rows pivoted from ~24k genes × ~600 samples), structural_variant (510 rows), gene_panel_matrix, resource_definition/patient, and clinical_supp_hypoxia. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Now that the cBioPortal ingest has been extended to cover SV, CNA, gene-panel matrix, resources, and clinical supplements, the dev slice grows from the original 6-table POC (patient, sample, mutation, 3 timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018. Full A→B→C staged pipeline results on all 12 tables: - 12/12 B_SUCCESS, 100% raw and critical coverage across every table - 0 retries, 0 splits, 0 rescues — zero recovery overhead - 69 Stage C calls → 195 has_decoded_value assertions - 259 has_property_name assertions (up from 222 on the 6-table slice) - Avg latency 25.2s / table (peak 105s on mutation's 114 columns, still under the 60s gate) - Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under the $0.10/table gate) Spot-checks on the four new table types: - structural_variant: correct entity "Structural Variant" with grain "one row per structural variant ... per sample"; Stage C correctly decoded in-frame vs frameshift mutation semantics - cna (long format): 4 columns classified as sample_id / hugo_symbol / entrez_gene_id / cna_value, one Stage C call - gene_panel_matrix, resource_definition, resource_patient: all identifier-heavy tables classified as expected Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
The staged A→B→C pipeline is proven on the 12-table dev slice and becomes the sole L2 path. Ripping out everything the rollout kept around through step 6. Removed from src/sema/engine/semantic.py: - PropertyInterpretation and TableInterpretation (old response schemas) - _PropertyBatchResult (two-pass batch schema) - build_interpretation_prompt, build_simplified_interpretation_prompt - build_summary_prompt, build_property_prompt - _needs_two_pass, _interpret_two_pass - _interpret_via_llm_client, _interpret_via_raw_llm - _run_summary_pass, _run_property_pass - _entity_assertions, _property_assertions, _interpretation_to_assertions - SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg Removed files: - src/sema/engine/semantic_utils.py (entire file — all legacy helpers) - tests/unit/test_two_pass_semantic.py (legacy path tests) Reshaped: - SemanticEngine.interpret_table now delegates to interpret_table_staged_with_metrics and returns just assertions — one staged path for every table regardless of width - pipeline.build_utils._run_semantic_interpretation drops the use_staged branch; always returns (assertions, _StagedOutput) - pipeline.build._run_pipeline_stages returns (assertions, staged_output) unconditionally - process_table, _spawn_workers*, and BuildConfig lose the use_staged flag; cli_eval drops --use-staged/--no-use-staged - Tests updated to mock the staged sequence (StageAResult + StageBBatchResult) instead of TableInterpretation Test suite: 1004/1004 passing, mypy clean on 94 source files. Test count dropped from 1041 → 1004 (the 37 removed tests all exercised the deprecated legacy path). Follow-up not addressed here: semantic.py (520) and build_utils.py (508) both exceed the project's 400-line file standard. They were already over (745 and 514 pre-cleanup). Splitting them is a separate refactor — the simplest next step is extracting interpret_table_staged_with_metrics + the Stage A/B/C runners into stage_utils.py, which shaves ~200 lines from semantic.py. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Validates that Task 11 refactor (legacy L2 code removal, 17-file diff, -1494 LOC) did not alter pipeline behavior. Results vs pre-cleanup step 5 v2: - 12/12 tables B_SUCCESS, 100% coverage, zero recovery - 259 has_property_name (identical to pre-cleanup) - 12 has_entity_name (identical) - 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic LLM variation well within run-to-run noise - Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise) - Diff: 23 added / 22 removed — symmetric, indicates zero regression Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Full staged pipeline on the 12-table slice, Neo4j wiped first. Pipeline: - 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery - 12 entities, 259 properties, 174 decoded values, 81 Stage C calls - 285s total / 23.8s avg, tokens 73,346 in + 34,614 out - Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget Neo4j state (3,755 nodes after materialization): - Catalog/Schema/DataSource: 1 each - Table: 12 ✓ - Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia Assessment', 'Patient Status Event', 'Sample Acquisition Event', 'Sample Genomic Profile Availability', 'Treatment Event', etc.) - Column: 259 ✓ - Property: 259 ✓ - ValueSet: 150 / Term: 290 (from Stage C) - Alias: 452 / Vocabulary: 143 (from L3) - Assertion provenance: 2,175 - Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE, CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present Diff vs pre-cleanup baseline (step5-stage-c-v2): - 45 added, 24 removed, 678 changed - Added: 18 aliases + 27 decoded values (Stage C picked more columns) - Removed: 14 decoded values + 10 aliases (LLM variation) - Zero high-value regressions (no property_name / semantic_type / entity_name losses) Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
174fe41 to
a047c78
Compare
6 tasks
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the monolithic single-pass / two-pass L2 semantic interpretation with a staged A→B→C→merge pipeline, adds a domain layer (healthcare first), a few-shot example library, and an evaluation harness.
source-semantic-hardeningOpenSpec change; steps 6 / holdout bias validated on 12-table POC slice, full 33-table corpus gated on remaining cBioPortal ingestWhat's in it
Staged L2 pipeline (§2–4)
StageAResult,StageBColumnResult,StageBBatchResult,StageBResult,StageCResult,StageCBatchResult,StageStatusschemas (src/sema/models/stages.py)B_SUCCESS / B_PARTIAL / B_FAILEDoutcomeshas_decoded_value; novocabulary_matchemitted from L2Domain layer (§1, §7)
DomainContext,DomainCandidatemodels +--domainCLI flag + config + profiler-based detection (src/sema/models/domain.py)Few-shot library (§8)
Evaluation harness (§5–6)
Cleanup (§11)
Milestone results (12-table POC slice)
See `eval-runs/step6-milestone-summary.md`.
Every removal cluster flagged across the 5-step rollout is root-caused and either design-intended (`vocabulary_match` → L3 per design §2a; `has_decoded_value` restored at step 5) or fixed (`BIOTYPE (STRING)` column-name leak in `46384de`, alias regression in `783266d`). No open systemic regressions; no high-value predicates lost.
What's still open
All blocked on ingesting the remaining ~21 cBioPortal tables (see `§11-bis Pending ingest` in `openspec/changes/source-semantic-hardening/tasks.md`):
Known issue discovered during spot-check
`patient.SUBTYPE=GBM_IDHmut-non-codel` gets decoded by Stage C as `"Glioblastoma, IDH mutant, non-codisplayed (non-codel)"`. "Non-codisplayed" is an LLM hallucination — the correct clinical term is non-codeleted (1p/19q codeletion status in glioma classification). Low-severity single-assertion issue; points to a real gap — Stage C emits a `codebook_lookup_needed` flag but nothing consumes it. Will be filed as a follow-up issue.
Test plan