refactor: extract cBioPortal to showcase/ and add generic few-shot base by deanban · Pull Request #74 · Nine-Sigma/sema

deanban · 2026-04-21T19:20:07Z

Closes #75, #76.

Follow-ups tracked separately: #77, #78, #79.

Summary

Two structural changes that set up SEMA OpenCore for additional sources/domains without rewriting the platform:

showcase/cbioportal_to_omop/ — moves the cBioPortal parser + slice YAMLs + their tests out of src/sema/ingest/ into a dedicated showcase folder. src/sema/ingest/ now holds only generic infrastructure (DuckDB staging, Databricks push, OMOP reference target).
Generic few-shot base — splits few_shot.py into a thin registry + two domain packs (few_shot_generic.py, few_shot_healthcare.py). format_examples() now composes generic + {domain} so every prompt gets industry-agnostic archetypes plus whatever domain-specific guidance exists. Healthcare-first becomes healthcare-on-top-of-generic.

Stacks on #63

Targets main. Based on the HEAD of dean/feat/source-semantic-hardening, so the diff will be clean once #63 merges.

What moved

from	to
`src/sema/ingest/cbioportal.py`	`showcase/cbioportal_to_omop/parsers.py`
`src/sema/ingest/cbioportal_utils.py`	`showcase/cbioportal_to_omop/cbioportal_utils.py`
`eval/dev_slice.yaml`	`showcase/cbioportal_to_omop/slices/dev_slice.yaml`
`eval/dev_slice_poc.yaml`	`showcase/cbioportal_to_omop/slices/dev_slice_poc.yaml`
`eval/holdout.yaml`	`showcase/cbioportal_to_omop/slices/holdout.yaml`
`tests/unit/test_cbioportal_parsers.py`	`tests/showcase/cbioportal_to_omop/test_parsers.py`
`tests/unit/test_cbioportal_extended_parsers.py`	`tests/showcase/cbioportal_to_omop/test_extended_parsers.py`

All moves use git mv so history is preserved.

Packaging

showcase/ is not part of the installable sema package — hatchling still only picks up src/sema/.
From a source checkout (uv run sema ingest cbioportal ...) the showcase is importable because the project root is on sys.path.
cli_ingest.py lazy-imports showcase.cbioportal_to_omop.parsers inside the command; if the showcase isn't importable, the user gets a clear ClickException instead of a startup ImportError.

Generic few-shot base

New few_shot_generic.py covers archetypes that appear across every industry:

Stage A (5 examples): event-stream, transaction-N-per-parent, dimension, bridge, wide-profile tables.
Stage B (8 examples): identifier (PK/FK), temporal, numeric, categorical-encoded, free text, boolean, ordinal.
Stage C (4 examples): status labels, Y/N flags, prefix-encoded codes (0:SUCCESS), ordinal ranking.

format_examples(domain, stage) now composes generic + {domain} instead of looking up a single domain key. Behavior changes:

format_examples(None, "A") previously returned "" (zero-shot). Now returns the generic base block — every prompt gets some teaching signal.
format_examples("healthcare", "B") previously emitted 12 healthcare examples. Now emits 8 generic + 12 healthcare = 20 composed examples. Token budget for Stage B raised from 1200 → 2100 to match.
get_examples(domain, stage) is unchanged — still returns the per-domain array only. compose_examples(domain, stage) is the new function for composed retrieval.

Test plan

uv run pytest — 1008 passed, 1 skipped
uv run mypy src/sema/ — clean on 94 source files
uv run pytest --cov=sema --cov-report=term — 87% coverage
CLI test updated: sema ingest cbioportal still exits 0 against the mocked showcase parser
Holdout disjointness test updated to new slice path
Added tests for the generic archetypes (Stage A/B/C) and the composition ordering

Follow-ups, not in this PR

Port the structured variant detection / fusion-partner few-shot once a second genomics source lands — right now the example still implies cBioPortal shapes.
Extract cBioPortal to a separate sema-cbioportal package with pip install sema[cbioportal] extras when a second source adapter shows up.
Populate vocab-family hints + semantic type inventory for financial, real_estate, logistics in domain_prompts.py (currently header-only).

Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

StageAResult, StageBColumnResult, StageBBatchResult, StageCResult, StageCBatchResult, StageStatus, UnresolvedColumn models. Stage A/B prompt builders with domain context slots. Critical column identification, coverage computation, B pass/fail logic. Stage C deterministic trigger with low-cardinality fallback. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

interpret_table_staged() runs full A→B→C→merge. Merge ownership matrix: A=entity, B=property, C=decoded values. Bounded B recovery: retry, split, Tier 1 rescue. semantic_unresolved produced for low-confidence ambiguous columns. VocabColumnContext enriched with B output at version 1. use_staged=True default with PromptLayers rollout flags. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Assertion dump/load for checkpoint comparison. Structured diff with regression flagging. TableTelemetry/PipelineTelemetry with milestone report builder. 13-table dev slice and 10-table holdout definitions. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Domain bias header with conflict handling. Healthcare/generic semantic type inventories. Vocabulary family hints for healthcare domain. 5 Stage A, 12 Stage B, 8 Stage C few-shot examples. Holdout disjointness verified. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL + vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with optional COPY INTO when a cloud staging URI is configured. - `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push - `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands - `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`, `IngestDatabricksTargetConfig` with env-prefix settings - `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps - `.env.example`: documents INGEST_* env vars Unit coverage across parsers, staging lifecycle, Databricks bridge provisioning, and CLI wiring (63 tests). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Makes rollout steps 2–6 of source-semantic-hardening executable with a single command per step. Per-table assertion + telemetry dumps are written when `eval_dump_dir` is set on BuildConfig; `slice_tables` filters discovered work items to a named subset. - `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the pipeline on the slice and writes `<table>__<label>.json` + paired `<table>__<label>__telemetry.json` per table. - `sema eval diff --baseline <d> --current <d>`: pairs dumps by table and aggregates semantic churn using the existing `diff_dumps` keyed on `(subject_ref, predicate)` — covers L2 and L3 assertions. - `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates telemetry across tables and, if a baseline dir is given, folds in the churn summary — produces a milestone-ready JSON report. Wiring: - `BuildConfig` gains `eval_dump_dir`, `eval_config_label`, `slice_tables`. - `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls the dump hook after successful runs; failures to dump are logged, not raised. - `_run_pipeline_stages` now returns `(assertions, staged_output)` so `process_table` can extract telemetry without a second pass. - `_discover_tables` filters via `_filter_work_items_to_slice` when `slice_tables` is set. Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

TableTelemetry previously carried zeros for stage_*_latency_ms and tokens_*. The staged engine's LLM calls were not measured and the kwargs to from_stages() were never passed. Eval runs reported avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and 60s/table latency gate unverifiable. - `LLMClient` gains an `InvocationStats` dataclass populated on every `invoke()`: wall-clock duration in ns, prompt/response char counts, and prompt/completion token counts (pulled from `usage_metadata` / `response_metadata` when present, else a ~4 chars/token estimate). - `SemanticEngine.interpret_table_staged_with_metrics()` wraps the client's `invoke` for the duration of a staged run so that every batched Stage B call and every Stage C column call contributes to the table's `StageMetrics` (tokens per stage via accumulation, latency per stage via `time.monotonic_ns` bookends). - `interpret_table_staged` preserved as a thin wrapper so existing callers are unchanged. - `build_utils._run_semantic_interpretation` now threads the metrics through to `TableTelemetry.from_stages()`. Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533, tokens input = 20983, tokens output = 23417 — well under the cost/latency gates. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…ortal ingest Six tables actually loaded in workspace.cbioportal (patient, sample, mutation, timeline_sample_acquisition, timeline_status, timeline_treatment) — a subset of the 13 tables in dev_slice.yaml. Used for initial rollout evaluation runs until full ingest lands. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Caught in the step-2 dev-slice eval: Stage B occasionally returned column names with an embedded type spec (e.g. 'BIOTYPE (STRING)', 'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column field verbatim to build '<table_ref>.<col>' subject refs, so a single noisy column silently severed the link between L2 property assertions and the extractor's COLUMN_EXISTS assertions. The downstream effect was a 'regression_risk' removal in the diff tool. Adds `sanitize_column_name` (strips the first whitespace / paren / bracket / colon onward) and applies it to every StageBColumnResult returned by `_invoke_stage_b_batch` before it reaches the merge or vocab-context builder. LLM non-determinism occasionally skips the leak entirely (step 3 domain-aware had zero) but the fix is cheap insurance and costs nothing on clean output. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain- aware). Root cause: none of the 12 Stage B few-shot examples in few_shot.py populated a `synonyms` field. The LLM imitated the examples' empty-by-omission pattern and dropped aliases that step 3 was emitting. Changes: - Add realistic `synonyms` lists to 8 of 12 Stage B examples (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol, variant_classification, agent, stage_highest). Examples without synonyms remain to demonstrate empty-is-valid. - Switch `format_examples` to compact JSON (no indent) — recoups most of the token cost added by the synonyms. Measured impact on 6-table dev slice: - Alias regression 52 → 16 (−69%) - Output tokens 22,935 → 23,566 (+631, LLM restored alias emission) - Input tokens 41,623 → 41,148 (−475, compact JSON) - All 6 tables still B_SUCCESS with 100% coverage The +17k input token bump from enabling few-shot in step 4 is the fixed cost of including the full Stage A+B+C blocks in each of 18+ LLM calls per slice run — not a bug, just the price of few-shot. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Versions the per-table assertion dumps, telemetry dumps, diff reports, and milestone reports produced during the source-semantic-hardening rollout. These back up the task completion claims in openspec/changes/source-semantic-hardening/tasks.md (which is in a gitignored path) and serve as a reference baseline for future evaluation runs. Contents of eval-runs/: - step2-baseline-single-pass/ # pre-decomposition reference - step2-staged-zeroshot/ # A→B decomposition, zero-shot - step3-domain-aware/ # + domain bias / type inventory / vocab hints - step4-few-shot/ # + healthcare few-shot (post alias-fix) - step5-stage-c/ # + Stage C value decoding (full pipeline) - step{2,3,4,5}-diff.json # churn summaries vs prior step - step{2,3,4,5}-report.json # per-step milestone reports - end-to-end-diff.json # baseline → full pipeline delta Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting current Databricks ingest. Holdout and full-corpus runs blocked on ingest of the remaining 27 cBioPortal tables — see §11-bis in tasks.md. eval-runs/*.log added to .gitignore (transient runtime output). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

… parsers Extends the cBioPortal ingest to cover five new file types, unlocking the remaining dev-slice tables (structural_variant, cna, gene_panel_matrix, resource_definition/patient, clinical_supp_*). New parsers: - parse_sv_file — data_sv.txt → structural_variant (position/ entrez-gene-id columns typed as BIGINT via sv_column_type helper) - parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to long format with sample_id / hugo_symbol / entrez_gene_id / cna_value. Blank cells become nulls. cna_long_format_rows helper lives in cbioportal_utils.py. - parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is - parse_resource_file — data_resource_*.txt (definition and per-patient/sample entries) Ingest orchestration: - _should_download now allows data_sv.txt, data_cna.txt, data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_* via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES / EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py - SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix files (expression, methylation, log2/linear/armlevel CNA, mrna, rppa) - _ingest_study_dir wires three new fixed-file parsers (_try_ingest_fixed_files) plus prefix-matched passes for data_resource_* and data_clinical_supp_* (_ingest_prefix_matched_files) Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now holds 12 cbioportal tables including cna (14.4M long-format rows pivoted from ~24k genes × ~600 samples), structural_variant (510 rows), gene_panel_matrix, resource_definition/patient, and clinical_supp_hypoxia. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Now that the cBioPortal ingest has been extended to cover SV, CNA, gene-panel matrix, resources, and clinical supplements, the dev slice grows from the original 6-table POC (patient, sample, mutation, 3 timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018. Full A→B→C staged pipeline results on all 12 tables: - 12/12 B_SUCCESS, 100% raw and critical coverage across every table - 0 retries, 0 splits, 0 rescues — zero recovery overhead - 69 Stage C calls → 195 has_decoded_value assertions - 259 has_property_name assertions (up from 222 on the 6-table slice) - Avg latency 25.2s / table (peak 105s on mutation's 114 columns, still under the 60s gate) - Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under the $0.10/table gate) Spot-checks on the four new table types: - structural_variant: correct entity "Structural Variant" with grain "one row per structural variant ... per sample"; Stage C correctly decoded in-frame vs frameshift mutation semantics - cna (long format): 4 columns classified as sample_id / hugo_symbol / entrez_gene_id / cna_value, one Stage C call - gene_panel_matrix, resource_definition, resource_patient: all identifier-heavy tables classified as expected Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

The staged A→B→C pipeline is proven on the 12-table dev slice and becomes the sole L2 path. Ripping out everything the rollout kept around through step 6. Removed from src/sema/engine/semantic.py: - PropertyInterpretation and TableInterpretation (old response schemas) - _PropertyBatchResult (two-pass batch schema) - build_interpretation_prompt, build_simplified_interpretation_prompt - build_summary_prompt, build_property_prompt - _needs_two_pass, _interpret_two_pass - _interpret_via_llm_client, _interpret_via_raw_llm - _run_summary_pass, _run_property_pass - _entity_assertions, _property_assertions, _interpretation_to_assertions - SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg Removed files: - src/sema/engine/semantic_utils.py (entire file — all legacy helpers) - tests/unit/test_two_pass_semantic.py (legacy path tests) Reshaped: - SemanticEngine.interpret_table now delegates to interpret_table_staged_with_metrics and returns just assertions — one staged path for every table regardless of width - pipeline.build_utils._run_semantic_interpretation drops the use_staged branch; always returns (assertions, _StagedOutput) - pipeline.build._run_pipeline_stages returns (assertions, staged_output) unconditionally - process_table, _spawn_workers*, and BuildConfig lose the use_staged flag; cli_eval drops --use-staged/--no-use-staged - Tests updated to mock the staged sequence (StageAResult + StageBBatchResult) instead of TableInterpretation Test suite: 1004/1004 passing, mypy clean on 94 source files. Test count dropped from 1041 → 1004 (the 37 removed tests all exercised the deprecated legacy path). Follow-up not addressed here: semantic.py (520) and build_utils.py (508) both exceed the project's 400-line file standard. They were already over (745 and 514 pre-cleanup). Splitting them is a separate refactor — the simplest next step is extracting interpret_table_staged_with_metrics + the Stage A/B/C runners into stage_utils.py, which shaves ~200 lines from semantic.py. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Validates that Task 11 refactor (legacy L2 code removal, 17-file diff, -1494 LOC) did not alter pipeline behavior. Results vs pre-cleanup step 5 v2: - 12/12 tables B_SUCCESS, 100% coverage, zero recovery - 259 has_property_name (identical to pre-cleanup) - 12 has_entity_name (identical) - 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic LLM variation well within run-to-run noise - Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise) - Diff: 23 added / 22 removed — symmetric, indicates zero regression Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Full staged pipeline on the 12-table slice, Neo4j wiped first. Pipeline: - 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery - 12 entities, 259 properties, 174 decoded values, 81 Stage C calls - 285s total / 23.8s avg, tokens 73,346 in + 34,614 out - Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget Neo4j state (3,755 nodes after materialization): - Catalog/Schema/DataSource: 1 each - Table: 12 ✓ - Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia Assessment', 'Patient Status Event', 'Sample Acquisition Event', 'Sample Genomic Profile Availability', 'Treatment Event', etc.) - Column: 259 ✓ - Property: 259 ✓ - ValueSet: 150 / Term: 290 (from Stage C) - Alias: 452 / Vocabulary: 143 (from L3) - Assertion provenance: 2,175 - Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE, CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present Diff vs pre-cleanup baseline (step5-stage-c-v2): - 45 added, 24 removed, 678 changed - Added: 18 aliases + 27 decoded values (Stage C picked more columns) - Removed: 14 decoded values + 10 aliases (LLM variation) - Zero high-value regressions (no property_name / semantic_type / entity_name losses) Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…ortal_to_omop/ Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…dules Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…tal-showcase-extract Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> # Conflicts: # src/sema/cli_eval.py # src/sema/cli_ingest.py # src/sema/engine/few_shot.py # tests/unit/test_cli_ingest.py # tests/unit/test_few_shot.py # tests/unit/test_few_shot_quality.py

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

…se (#74) * feat: add DomainContext model, CLI flag, and pipeline wiring Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: add staged L2 schemas, Stage A/B prompts, and trigger logic StageAResult, StageBColumnResult, StageBBatchResult, StageCResult, StageCBatchResult, StageStatus, UnresolvedColumn models. Stage A/B prompt builders with domain context slots. Critical column identification, coverage computation, B pass/fail logic. Stage C deterministic trigger with low-cardinality fallback. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: wire A→B→C→merge pipeline with recovery and enriched vocab context interpret_table_staged() runs full A→B→C→merge. Merge ownership matrix: A=entity, B=property, C=decoded values. Bounded B recovery: retry, split, Tier 1 rescue. semantic_unresolved produced for low-confidence ambiguous columns. VocabColumnContext enriched with B output at version 1. use_staged=True default with PromptLayers rollout flags. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: add eval harness — assertion dump, diff, telemetry, dev slice Assertion dump/load for checkpoint comparison. Structured diff with regression flagging. TableTelemetry/PipelineTelemetry with milestone report builder. 13-table dev slice and 10-table holdout definitions. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: add domain-aware prompts and healthcare few-shot library Domain bias header with conflict handling. Healthcare/generic semantic type inventories. Vocabulary family hints for healthcare domain. 5 Stage A, 12 Stage B, 8 Stage C few-shot examples. Holdout disjointness verified. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * test: add Stage C trigger, execution, merge, and partial failure tests Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: add cBioPortal + OMOP ingest pipeline and Databricks bridge Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL + vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with optional COPY INTO when a cloud staging URI is configured. - `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push - `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands - `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`, `IngestDatabricksTargetConfig` with env-prefix settings - `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps - `.env.example`: documents INGEST_* env vars Unit coverage across parsers, staging lifecycle, Databricks bridge provisioning, and CLI wiring (63 tests). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * chore: gitignore .wolf/ OpenWolf context directory Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat: add sema eval CLI for dev-slice runner, diff, and milestone report Makes rollout steps 2–6 of source-semantic-hardening executable with a single command per step. Per-table assertion + telemetry dumps are written when `eval_dump_dir` is set on BuildConfig; `slice_tables` filters discovered work items to a named subset. - `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the pipeline on the slice and writes `<table>__<label>.json` + paired `<table>__<label>__telemetry.json` per table. - `sema eval diff --baseline <d> --current <d>`: pairs dumps by table and aggregates semantic churn using the existing `diff_dumps` keyed on `(subject_ref, predicate)` — covers L2 and L3 assertions. - `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates telemetry across tables and, if a baseline dir is given, folds in the churn summary — produces a milestone-ready JSON report. Wiring: - `BuildConfig` gains `eval_dump_dir`, `eval_config_label`, `slice_tables`. - `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls the dump hook after successful runs; failures to dump are logged, not raised. - `_run_pipeline_stages` now returns `(assertions, staged_output)` so `process_table` can extract telemetry without a second pass. - `_discover_tables` filters via `_filter_work_items_to_slice` when `slice_tables` is set. Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * fix: populate real latency and token telemetry in staged L2 pipeline TableTelemetry previously carried zeros for stage_*_latency_ms and tokens_*. The staged engine's LLM calls were not measured and the kwargs to from_stages() were never passed. Eval runs reported avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and 60s/table latency gate unverifiable. - `LLMClient` gains an `InvocationStats` dataclass populated on every `invoke()`: wall-clock duration in ns, prompt/response char counts, and prompt/completion token counts (pulled from `usage_metadata` / `response_metadata` when present, else a ~4 chars/token estimate). - `SemanticEngine.interpret_table_staged_with_metrics()` wraps the client's `invoke` for the duration of a staged run so that every batched Stage B call and every Stage C column call contributes to the table's `StageMetrics` (tokens per stage via accumulation, latency per stage via `time.monotonic_ns` bookends). - `interpret_table_staged` preserved as a thin wrapper so existing callers are unchanged. - `build_utils._run_semantic_interpretation` now threads the metrics through to `TableTelemetry.from_stages()`. Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533, tokens input = 20983, tokens output = 23417 — well under the cost/latency gates. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * chore(eval): add dev_slice_poc.yaml matching current Databricks cBioPortal ingest Six tables actually loaded in workspace.cbioportal (patient, sample, mutation, timeline_sample_acquisition, timeline_status, timeline_treatment) — a subset of the 13 tables in dev_slice.yaml. Used for initial rollout evaluation runs until full ingest lands. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * fix: sanitize LLM-leaked type suffix from Stage B column names Caught in the step-2 dev-slice eval: Stage B occasionally returned column names with an embedded type spec (e.g. 'BIOTYPE (STRING)', 'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column field verbatim to build '<table_ref>.<col>' subject refs, so a single noisy column silently severed the link between L2 property assertions and the extractor's COLUMN_EXISTS assertions. The downstream effect was a 'regression_risk' removal in the diff tool. Adds `sanitize_column_name` (strips the first whitespace / paren / bracket / colon onward) and applies it to every StageBColumnResult returned by `_invoke_stage_b_batch` before it reaches the merge or vocab-context builder. LLM non-determinism occasionally skips the leak entirely (step 3 domain-aware had zero) but the fix is cheap insurance and costs nothing on clean output. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * fix(few-shot): add synonyms to Stage B examples and compact JSON format Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain- aware). Root cause: none of the 12 Stage B few-shot examples in few_shot.py populated a `synonyms` field. The LLM imitated the examples' empty-by-omission pattern and dropped aliases that step 3 was emitting. Changes: - Add realistic `synonyms` lists to 8 of 12 Stage B examples (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol, variant_classification, agent, stage_highest). Examples without synonyms remain to demonstrate empty-is-valid. - Switch `format_examples` to compact JSON (no indent) — recoups most of the token cost added by the synonyms. Measured impact on 6-table dev slice: - Alias regression 52 → 16 (−69%) - Output tokens 22,935 → 23,566 (+631, LLM restored alias emission) - Input tokens 41,623 → 41,148 (−475, compact JSON) - All 6 tables still B_SUCCESS with 100% coverage The +17k input token bump from enabling few-shot in step 4 is the fixed cost of including the full Stage A+B+C blocks in each of 18+ LLM calls per slice run — not a bug, just the price of few-shot. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * eval: add dev-slice rollout artifacts for steps 2–5 Versions the per-table assertion dumps, telemetry dumps, diff reports, and milestone reports produced during the source-semantic-hardening rollout. These back up the task completion claims in openspec/changes/source-semantic-hardening/tasks.md (which is in a gitignored path) and serve as a reference baseline for future evaluation runs. Contents of eval-runs/: - step2-baseline-single-pass/ # pre-decomposition reference - step2-staged-zeroshot/ # A→B decomposition, zero-shot - step3-domain-aware/ # + domain bias / type inventory / vocab hints - step4-few-shot/ # + healthcare few-shot (post alias-fix) - step5-stage-c/ # + Stage C value decoding (full pipeline) - step{2,3,4,5}-diff.json # churn summaries vs prior step - step{2,3,4,5}-report.json # per-step milestone reports - end-to-end-diff.json # baseline → full pipeline delta Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting current Databricks ingest. Holdout and full-corpus runs blocked on ingest of the remaining 27 cBioPortal tables — see §11-bis in tasks.md. eval-runs/*.log added to .gitignore (transient runtime output). Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat(ingest): add cBioPortal SV, CNA, gene-panel-matrix, and resource parsers Extends the cBioPortal ingest to cover five new file types, unlocking the remaining dev-slice tables (structural_variant, cna, gene_panel_matrix, resource_definition/patient, clinical_supp_*). New parsers: - parse_sv_file — data_sv.txt → structural_variant (position/ entrez-gene-id columns typed as BIGINT via sv_column_type helper) - parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to long format with sample_id / hugo_symbol / entrez_gene_id / cna_value. Blank cells become nulls. cna_long_format_rows helper lives in cbioportal_utils.py. - parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is - parse_resource_file — data_resource_*.txt (definition and per-patient/sample entries) Ingest orchestration: - _should_download now allows data_sv.txt, data_cna.txt, data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_* via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES / EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py - SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix files (expression, methylation, log2/linear/armlevel CNA, mrna, rppa) - _ingest_study_dir wires three new fixed-file parsers (_try_ingest_fixed_files) plus prefix-matched passes for data_resource_* and data_clinical_supp_* (_ingest_prefix_matched_files) Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now holds 12 cbioportal tables including cna (14.4M long-format rows pivoted from ~24k genes × ~600 samples), structural_variant (510 rows), gene_panel_matrix, resource_definition/patient, and clinical_supp_hypoxia. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * eval: expand dev slice to 12 tables, re-run full pipeline on GBM ingest Now that the cBioPortal ingest has been extended to cover SV, CNA, gene-panel matrix, resources, and clinical supplements, the dev slice grows from the original 6-table POC (patient, sample, mutation, 3 timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018. Full A→B→C staged pipeline results on all 12 tables: - 12/12 B_SUCCESS, 100% raw and critical coverage across every table - 0 retries, 0 splits, 0 rescues — zero recovery overhead - 69 Stage C calls → 195 has_decoded_value assertions - 259 has_property_name assertions (up from 222 on the 6-table slice) - Avg latency 25.2s / table (peak 105s on mutation's 114 columns, still under the 60s gate) - Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under the $0.10/table gate) Spot-checks on the four new table types: - structural_variant: correct entity "Structural Variant" with grain "one row per structural variant ... per sample"; Stage C correctly decoded in-frame vs frameshift mutation semantics - cna (long format): 4 columns classified as sample_id / hugo_symbol / entrez_gene_id / cna_value, one Stage C call - gene_panel_matrix, resource_definition, resource_patient: all identifier-heavy tables classified as expected Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * refactor: remove deprecated single-pass and two-pass L2 code (Task 11) The staged A→B→C pipeline is proven on the 12-table dev slice and becomes the sole L2 path. Ripping out everything the rollout kept around through step 6. Removed from src/sema/engine/semantic.py: - PropertyInterpretation and TableInterpretation (old response schemas) - _PropertyBatchResult (two-pass batch schema) - build_interpretation_prompt, build_simplified_interpretation_prompt - build_summary_prompt, build_property_prompt - _needs_two_pass, _interpret_two_pass - _interpret_via_llm_client, _interpret_via_raw_llm - _run_summary_pass, _run_property_pass - _entity_assertions, _property_assertions, _interpretation_to_assertions - SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg Removed files: - src/sema/engine/semantic_utils.py (entire file — all legacy helpers) - tests/unit/test_two_pass_semantic.py (legacy path tests) Reshaped: - SemanticEngine.interpret_table now delegates to interpret_table_staged_with_metrics and returns just assertions — one staged path for every table regardless of width - pipeline.build_utils._run_semantic_interpretation drops the use_staged branch; always returns (assertions, _StagedOutput) - pipeline.build._run_pipeline_stages returns (assertions, staged_output) unconditionally - process_table, _spawn_workers*, and BuildConfig lose the use_staged flag; cli_eval drops --use-staged/--no-use-staged - Tests updated to mock the staged sequence (StageAResult + StageBBatchResult) instead of TableInterpretation Test suite: 1004/1004 passing, mypy clean on 94 source files. Test count dropped from 1041 → 1004 (the 37 removed tests all exercised the deprecated legacy path). Follow-up not addressed here: semantic.py (520) and build_utils.py (508) both exceed the project's 400-line file standard. They were already over (745 and 514 pre-cleanup). Splitting them is a separate refactor — the simplest next step is extracting interpret_table_staged_with_metrics + the Stage A/B/C runners into stage_utils.py, which shaves ~200 lines from semantic.py. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * eval: post-cleanup sanity run on 12-table slice Validates that Task 11 refactor (legacy L2 code removal, 17-file diff, -1494 LOC) did not alter pipeline behavior. Results vs pre-cleanup step 5 v2: - 12/12 tables B_SUCCESS, 100% coverage, zero recovery - 259 has_property_name (identical to pre-cleanup) - 12 has_entity_name (identical) - 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic LLM variation well within run-to-run noise - Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise) - Diff: 23 added / 22 removed — symmetric, indicates zero regression Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * eval: verification run after Neo4j wipe + Task 11 cleanup Full staged pipeline on the 12-table slice, Neo4j wiped first. Pipeline: - 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery - 12 entities, 259 properties, 174 decoded values, 81 Stage C calls - 285s total / 23.8s avg, tokens 73,346 in + 34,614 out - Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget Neo4j state (3,755 nodes after materialization): - Catalog/Schema/DataSource: 1 each - Table: 12 ✓ - Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia Assessment', 'Patient Status Event', 'Sample Acquisition Event', 'Sample Genomic Profile Availability', 'Treatment Event', etc.) - Column: 259 ✓ - Property: 259 ✓ - ValueSet: 150 / Term: 290 (from Stage C) - Alias: 452 / Vocabulary: 143 (from L3) - Assertion provenance: 2,175 - Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE, CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present Diff vs pre-cleanup baseline (step5-stage-c-v2): - 45 added, 24 removed, 678 changed - Added: 18 aliases + 27 decoded values (Stage C picked more columns) - Removed: 14 decoded values + 10 aliases (LLM variation) - Zero high-value regressions (no property_name / semantic_type / entity_name losses) Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * docs(eval): add step 6 milestone summary for 12-table POC slice Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * refactor: extract cBioPortal ingest and slice YAMLs to showcase/cbioportal_to_omop/ Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * feat(few-shot): add generic base layer and split domain packs into modules Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> * ci: drop single-entry python matrix so test context matches branch rule Signed-off-by: deanban <3989225+deanban@users.noreply.github.com> --------- Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

deanban added 23 commits April 21, 2026 13:02

feat: add DomainContext model, CLI flag, and pipeline wiring

37178f9

Domain precedence: CLI > config > profiler > default. Profiler evidence preserved when CLI/config overrides. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

test: add Stage C trigger, execution, merge, and partial failure tests

7beea0a

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

chore: gitignore .wolf/ OpenWolf context directory

b2577f7

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

docs(eval): add step 6 milestone summary for 12-table POC slice

a047c78

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

refactor: extract cBioPortal ingest and slice YAMLs to showcase/cbiop…

88849d8

…ortal_to_omop/ Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

feat(few-shot): add generic base layer and split domain packs into mo…

6389a57

…dules Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

This was referenced Apr 21, 2026

refactor: extract cBioPortal to showcase/cbioportal_to_omop/ #75

Closed

feat: generic few-shot base layer with domain pack composition #76

Closed

ci: drop single-entry python matrix so test context matches branch rule

4cb42e1

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

deanban merged commit f946eed into main Apr 22, 2026
3 checks passed

deanban deleted the dean/refactor/cbioportal-showcase-extract branch April 22, 2026 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract cBioPortal to showcase/ and add generic few-shot base#74

refactor: extract cBioPortal to showcase/ and add generic few-shot base#74
deanban merged 24 commits into
mainfrom
dean/refactor/cbioportal-showcase-extract

deanban commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deanban commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stacks on #63

What moved

Packaging

Generic few-shot base

Test plan

Follow-ups, not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deanban commented Apr 21, 2026 •

edited

Loading