Skip to content

refactor: extract cBioPortal to showcase/ and add generic few-shot base#74

Merged
deanban merged 24 commits into
mainfrom
dean/refactor/cbioportal-showcase-extract
Apr 22, 2026
Merged

refactor: extract cBioPortal to showcase/ and add generic few-shot base#74
deanban merged 24 commits into
mainfrom
dean/refactor/cbioportal-showcase-extract

Conversation

@deanban

@deanban deanban commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Closes #75, #76.

Follow-ups tracked separately: #77, #78, #79.


Summary

Two structural changes that set up SEMA OpenCore for additional sources/domains without rewriting the platform:

  1. showcase/cbioportal_to_omop/ — moves the cBioPortal parser + slice YAMLs + their tests out of src/sema/ingest/ into a dedicated showcase folder. src/sema/ingest/ now holds only generic infrastructure (DuckDB staging, Databricks push, OMOP reference target).
  2. Generic few-shot base — splits few_shot.py into a thin registry + two domain packs (few_shot_generic.py, few_shot_healthcare.py). format_examples() now composes generic + {domain} so every prompt gets industry-agnostic archetypes plus whatever domain-specific guidance exists. Healthcare-first becomes healthcare-on-top-of-generic.

Stacks on #63

Targets main. Based on the HEAD of dean/feat/source-semantic-hardening, so the diff will be clean once #63 merges.

What moved

from to
src/sema/ingest/cbioportal.py showcase/cbioportal_to_omop/parsers.py
src/sema/ingest/cbioportal_utils.py showcase/cbioportal_to_omop/cbioportal_utils.py
eval/dev_slice.yaml showcase/cbioportal_to_omop/slices/dev_slice.yaml
eval/dev_slice_poc.yaml showcase/cbioportal_to_omop/slices/dev_slice_poc.yaml
eval/holdout.yaml showcase/cbioportal_to_omop/slices/holdout.yaml
tests/unit/test_cbioportal_parsers.py tests/showcase/cbioportal_to_omop/test_parsers.py
tests/unit/test_cbioportal_extended_parsers.py tests/showcase/cbioportal_to_omop/test_extended_parsers.py

All moves use git mv so history is preserved.

Packaging

  • showcase/ is not part of the installable sema package — hatchling still only picks up src/sema/.
  • From a source checkout (uv run sema ingest cbioportal ...) the showcase is importable because the project root is on sys.path.
  • cli_ingest.py lazy-imports showcase.cbioportal_to_omop.parsers inside the command; if the showcase isn't importable, the user gets a clear ClickException instead of a startup ImportError.

Generic few-shot base

New few_shot_generic.py covers archetypes that appear across every industry:

  • Stage A (5 examples): event-stream, transaction-N-per-parent, dimension, bridge, wide-profile tables.
  • Stage B (8 examples): identifier (PK/FK), temporal, numeric, categorical-encoded, free text, boolean, ordinal.
  • Stage C (4 examples): status labels, Y/N flags, prefix-encoded codes (0:SUCCESS), ordinal ranking.

format_examples(domain, stage) now composes generic + {domain} instead of looking up a single domain key. Behavior changes:

  • format_examples(None, "A") previously returned "" (zero-shot). Now returns the generic base block — every prompt gets some teaching signal.
  • format_examples("healthcare", "B") previously emitted 12 healthcare examples. Now emits 8 generic + 12 healthcare = 20 composed examples. Token budget for Stage B raised from 1200 → 2100 to match.
  • get_examples(domain, stage) is unchanged — still returns the per-domain array only. compose_examples(domain, stage) is the new function for composed retrieval.

Test plan

  • uv run pytest — 1008 passed, 1 skipped
  • uv run mypy src/sema/ — clean on 94 source files
  • uv run pytest --cov=sema --cov-report=term — 87% coverage
  • CLI test updated: sema ingest cbioportal still exits 0 against the mocked showcase parser
  • Holdout disjointness test updated to new slice path
  • Added tests for the generic archetypes (Stage A/B/C) and the composition ordering

Follow-ups, not in this PR

  • Port the structured variant detection / fusion-partner few-shot once a second genomics source lands — right now the example still implies cBioPortal shapes.
  • Extract cBioPortal to a separate sema-cbioportal package with pip install sema[cbioportal] extras when a second source adapter shows up.
  • Populate vocab-family hints + semantic type inventory for financial, real_estate, logistics in domain_prompts.py (currently header-only).

deanban added 23 commits April 21, 2026 13:02
Domain precedence: CLI > config > profiler > default.
Profiler evidence preserved when CLI/config overrides.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
StageAResult, StageBColumnResult, StageBBatchResult, StageCResult,
StageCBatchResult, StageStatus, UnresolvedColumn models.
Stage A/B prompt builders with domain context slots.
Critical column identification, coverage computation, B pass/fail logic.
Stage C deterministic trigger with low-cardinality fallback.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
interpret_table_staged() runs full A→B→C→merge.
Merge ownership matrix: A=entity, B=property, C=decoded values.
Bounded B recovery: retry, split, Tier 1 rescue.
semantic_unresolved produced for low-confidence ambiguous columns.
VocabColumnContext enriched with B output at version 1.
use_staged=True default with PromptLayers rollout flags.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Assertion dump/load for checkpoint comparison.
Structured diff with regression flagging.
TableTelemetry/PipelineTelemetry with milestone report builder.
13-table dev slice and 10-table holdout definitions.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Domain bias header with conflict handling.
Healthcare/generic semantic type inventories.
Vocabulary family hints for healthcare domain.
5 Stage A, 12 Stage B, 8 Stage C few-shot examples.
Holdout disjointness verified.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging
area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL +
vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with
optional COPY INTO when a cloud staging URI is configured.

- `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push
- `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands
- `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`,
  `IngestDatabricksTargetConfig` with env-prefix settings
- `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps
- `.env.example`: documents INGEST_* env vars

Unit coverage across parsers, staging lifecycle, Databricks bridge
provisioning, and CLI wiring (63 tests).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Makes rollout steps 2–6 of source-semantic-hardening executable with a
single command per step. Per-table assertion + telemetry dumps are
written when `eval_dump_dir` is set on BuildConfig; `slice_tables`
filters discovered work items to a named subset.

- `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the
  pipeline on the slice and writes `<table>__<label>.json` + paired
  `<table>__<label>__telemetry.json` per table.
- `sema eval diff --baseline <d> --current <d>`: pairs dumps by table
  and aggregates semantic churn using the existing `diff_dumps` keyed
  on `(subject_ref, predicate)` — covers L2 and L3 assertions.
- `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates
  telemetry across tables and, if a baseline dir is given, folds in the
  churn summary — produces a milestone-ready JSON report.

Wiring:
- `BuildConfig` gains `eval_dump_dir`, `eval_config_label`,
  `slice_tables`.
- `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls
  the dump hook after successful runs; failures to dump are logged, not
  raised.
- `_run_pipeline_stages` now returns `(assertions, staged_output)` so
  `process_table` can extract telemetry without a second pass.
- `_discover_tables` filters via `_filter_work_items_to_slice` when
  `slice_tables` is set.

Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact
commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
TableTelemetry previously carried zeros for stage_*_latency_ms and
tokens_*. The staged engine's LLM calls were not measured and the
kwargs to from_stages() were never passed. Eval runs reported
avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and
60s/table latency gate unverifiable.

- `LLMClient` gains an `InvocationStats` dataclass populated on every
  `invoke()`: wall-clock duration in ns, prompt/response char counts,
  and prompt/completion token counts (pulled from `usage_metadata` /
  `response_metadata` when present, else a ~4 chars/token estimate).
- `SemanticEngine.interpret_table_staged_with_metrics()` wraps the
  client's `invoke` for the duration of a staged run so that every
  batched Stage B call and every Stage C column call contributes to
  the table's `StageMetrics` (tokens per stage via accumulation,
  latency per stage via `time.monotonic_ns` bookends).
- `interpret_table_staged` preserved as a thin wrapper so existing
  callers are unchanged.
- `build_utils._run_semantic_interpretation` now threads the metrics
  through to `TableTelemetry.from_stages()`.

Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533,
tokens input = 20983, tokens output = 23417 — well under the
cost/latency gates.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…ortal ingest

Six tables actually loaded in workspace.cbioportal (patient, sample,
mutation, timeline_sample_acquisition, timeline_status,
timeline_treatment) — a subset of the 13 tables in dev_slice.yaml.
Used for initial rollout evaluation runs until full ingest lands.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Caught in the step-2 dev-slice eval: Stage B occasionally returned
column names with an embedded type spec (e.g. 'BIOTYPE (STRING)',
'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column
field verbatim to build '<table_ref>.<col>' subject refs, so a single
noisy column silently severed the link between L2 property
assertions and the extractor's COLUMN_EXISTS assertions. The
downstream effect was a 'regression_risk' removal in the diff tool.

Adds `sanitize_column_name` (strips the first whitespace / paren /
bracket / colon onward) and applies it to every StageBColumnResult
returned by `_invoke_stage_b_batch` before it reaches the merge or
vocab-context builder. LLM non-determinism occasionally skips the
leak entirely (step 3 domain-aware had zero) but the fix is cheap
insurance and costs nothing on clean output.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain-
aware). Root cause: none of the 12 Stage B few-shot examples in
few_shot.py populated a `synonyms` field. The LLM imitated the
examples' empty-by-omission pattern and dropped aliases that step 3
was emitting.

Changes:
- Add realistic `synonyms` lists to 8 of 12 Stage B examples
  (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol,
  variant_classification, agent, stage_highest). Examples without
  synonyms remain to demonstrate empty-is-valid.
- Switch `format_examples` to compact JSON (no indent) — recoups most
  of the token cost added by the synonyms.

Measured impact on 6-table dev slice:
- Alias regression 52 → 16 (−69%)
- Output tokens 22,935 → 23,566 (+631, LLM restored alias emission)
- Input tokens 41,623 → 41,148 (−475, compact JSON)
- All 6 tables still B_SUCCESS with 100% coverage

The +17k input token bump from enabling few-shot in step 4 is the
fixed cost of including the full Stage A+B+C blocks in each of 18+
LLM calls per slice run — not a bug, just the price of few-shot.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Versions the per-table assertion dumps, telemetry dumps, diff reports,
and milestone reports produced during the source-semantic-hardening
rollout. These back up the task completion claims in
openspec/changes/source-semantic-hardening/tasks.md (which is in a
gitignored path) and serve as a reference baseline for future
evaluation runs.

Contents of eval-runs/:
- step2-baseline-single-pass/  # pre-decomposition reference
- step2-staged-zeroshot/        # A→B decomposition, zero-shot
- step3-domain-aware/           # + domain bias / type inventory / vocab hints
- step4-few-shot/               # + healthcare few-shot (post alias-fix)
- step5-stage-c/                # + Stage C value decoding (full pipeline)
- step{2,3,4,5}-diff.json       # churn summaries vs prior step
- step{2,3,4,5}-report.json     # per-step milestone reports
- end-to-end-diff.json          # baseline → full pipeline delta

Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting
current Databricks ingest. Holdout and full-corpus runs blocked on
ingest of the remaining 27 cBioPortal tables — see §11-bis in
tasks.md.

eval-runs/*.log added to .gitignore (transient runtime output).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
… parsers

Extends the cBioPortal ingest to cover five new file types, unlocking
the remaining dev-slice tables (structural_variant, cna,
gene_panel_matrix, resource_definition/patient, clinical_supp_*).

New parsers:
- parse_sv_file — data_sv.txt → structural_variant (position/
  entrez-gene-id columns typed as BIGINT via sv_column_type helper)
- parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to
  long format with sample_id / hugo_symbol / entrez_gene_id /
  cna_value. Blank cells become nulls. cna_long_format_rows helper
  lives in cbioportal_utils.py.
- parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is
- parse_resource_file — data_resource_*.txt (definition and
  per-patient/sample entries)

Ingest orchestration:
- _should_download now allows data_sv.txt, data_cna.txt,
  data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_*
  via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES /
  EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py
- SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix
  files (expression, methylation, log2/linear/armlevel CNA, mrna,
  rppa)
- _ingest_study_dir wires three new fixed-file parsers
  (_try_ingest_fixed_files) plus prefix-matched passes for
  data_resource_* and data_clinical_supp_*
  (_ingest_prefix_matched_files)

Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now
holds 12 cbioportal tables including cna (14.4M long-format rows
pivoted from ~24k genes × ~600 samples), structural_variant (510
rows), gene_panel_matrix, resource_definition/patient, and
clinical_supp_hypoxia.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Now that the cBioPortal ingest has been extended to cover SV, CNA,
gene-panel matrix, resources, and clinical supplements, the dev slice
grows from the original 6-table POC (patient, sample, mutation, 3
timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018.

Full A→B→C staged pipeline results on all 12 tables:

- 12/12 B_SUCCESS, 100% raw and critical coverage across every table
- 0 retries, 0 splits, 0 rescues — zero recovery overhead
- 69 Stage C calls → 195 has_decoded_value assertions
- 259 has_property_name assertions (up from 222 on the 6-table slice)
- Avg latency 25.2s / table (peak 105s on mutation's 114 columns,
  still under the 60s gate)
- Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under
  the $0.10/table gate)

Spot-checks on the four new table types:
- structural_variant: correct entity "Structural Variant" with grain
  "one row per structural variant ... per sample"; Stage C correctly
  decoded in-frame vs frameshift mutation semantics
- cna (long format): 4 columns classified as sample_id /
  hugo_symbol / entrez_gene_id / cna_value, one Stage C call
- gene_panel_matrix, resource_definition, resource_patient: all
  identifier-heavy tables classified as expected

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
The staged A→B→C pipeline is proven on the 12-table dev slice and
becomes the sole L2 path. Ripping out everything the rollout kept
around through step 6.

Removed from src/sema/engine/semantic.py:
- PropertyInterpretation and TableInterpretation (old response schemas)
- _PropertyBatchResult (two-pass batch schema)
- build_interpretation_prompt, build_simplified_interpretation_prompt
- build_summary_prompt, build_property_prompt
- _needs_two_pass, _interpret_two_pass
- _interpret_via_llm_client, _interpret_via_raw_llm
- _run_summary_pass, _run_property_pass
- _entity_assertions, _property_assertions,
  _interpretation_to_assertions
- SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg

Removed files:
- src/sema/engine/semantic_utils.py (entire file — all legacy helpers)
- tests/unit/test_two_pass_semantic.py (legacy path tests)

Reshaped:
- SemanticEngine.interpret_table now delegates to
  interpret_table_staged_with_metrics and returns just assertions —
  one staged path for every table regardless of width
- pipeline.build_utils._run_semantic_interpretation drops the
  use_staged branch; always returns (assertions, _StagedOutput)
- pipeline.build._run_pipeline_stages returns
  (assertions, staged_output) unconditionally
- process_table, _spawn_workers*, and BuildConfig lose the
  use_staged flag; cli_eval drops --use-staged/--no-use-staged
- Tests updated to mock the staged sequence (StageAResult +
  StageBBatchResult) instead of TableInterpretation

Test suite: 1004/1004 passing, mypy clean on 94 source files.
Test count dropped from 1041 → 1004 (the 37 removed tests all
exercised the deprecated legacy path).

Follow-up not addressed here: semantic.py (520) and build_utils.py
(508) both exceed the project's 400-line file standard. They were
already over (745 and 514 pre-cleanup). Splitting them is a separate
refactor — the simplest next step is extracting
interpret_table_staged_with_metrics + the Stage A/B/C runners into
stage_utils.py, which shaves ~200 lines from semantic.py.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Validates that Task 11 refactor (legacy L2 code removal, 17-file diff,
-1494 LOC) did not alter pipeline behavior.

Results vs pre-cleanup step 5 v2:
- 12/12 tables B_SUCCESS, 100% coverage, zero recovery
- 259 has_property_name (identical to pre-cleanup)
- 12 has_entity_name (identical)
- 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic
  LLM variation well within run-to-run noise
- Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise)
- Diff: 23 added / 22 removed — symmetric, indicates zero regression

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Full staged pipeline on the 12-table slice, Neo4j wiped first.

Pipeline:
- 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery
- 12 entities, 259 properties, 174 decoded values, 81 Stage C calls
- 285s total / 23.8s avg, tokens 73,346 in + 34,614 out
- Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget

Neo4j state (3,755 nodes after materialization):
- Catalog/Schema/DataSource: 1 each
- Table: 12 ✓
- Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number
  Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia
  Assessment', 'Patient Status Event', 'Sample Acquisition Event',
  'Sample Genomic Profile Availability', 'Treatment Event', etc.)
- Column: 259 ✓
- Property: 259 ✓
- ValueSet: 150 / Term: 290 (from Stage C)
- Alias: 452 / Vocabulary: 143 (from L3)
- Assertion provenance: 2,175
- Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE,
  CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present

Diff vs pre-cleanup baseline (step5-stage-c-v2):
- 45 added, 24 removed, 678 changed
- Added: 18 aliases + 27 decoded values (Stage C picked more columns)
- Removed: 14 decoded values + 10 aliases (LLM variation)
- Zero high-value regressions (no property_name / semantic_type /
  entity_name losses)

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…ortal_to_omop/

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…dules

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…tal-showcase-extract

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

# Conflicts:
#	src/sema/cli_eval.py
#	src/sema/cli_ingest.py
#	src/sema/engine/few_shot.py
#	tests/unit/test_cli_ingest.py
#	tests/unit/test_few_shot.py
#	tests/unit/test_few_shot_quality.py
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
@deanban deanban merged commit f946eed into main Apr 22, 2026
3 checks passed
@deanban deanban deleted the dean/refactor/cbioportal-showcase-extract branch April 22, 2026 00:12
deanban added a commit that referenced this pull request Apr 23, 2026
…se (#74)

* feat: add DomainContext model, CLI flag, and pipeline wiring

Domain precedence: CLI > config > profiler > default.
Profiler evidence preserved when CLI/config overrides.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: add staged L2 schemas, Stage A/B prompts, and trigger logic

StageAResult, StageBColumnResult, StageBBatchResult, StageCResult,
StageCBatchResult, StageStatus, UnresolvedColumn models.
Stage A/B prompt builders with domain context slots.
Critical column identification, coverage computation, B pass/fail logic.
Stage C deterministic trigger with low-cardinality fallback.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: wire A→B→C→merge pipeline with recovery and enriched vocab context

interpret_table_staged() runs full A→B→C→merge.
Merge ownership matrix: A=entity, B=property, C=decoded values.
Bounded B recovery: retry, split, Tier 1 rescue.
semantic_unresolved produced for low-confidence ambiguous columns.
VocabColumnContext enriched with B output at version 1.
use_staged=True default with PromptLayers rollout flags.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: add eval harness — assertion dump, diff, telemetry, dev slice

Assertion dump/load for checkpoint comparison.
Structured diff with regression flagging.
TableTelemetry/PipelineTelemetry with milestone report builder.
13-table dev slice and 10-table holdout definitions.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: add domain-aware prompts and healthcare few-shot library

Domain bias header with conflict handling.
Healthcare/generic semantic type inventories.
Vocabulary family hints for healthcare domain.
5 Stage A, 12 Stage B, 8 Stage C few-shot examples.
Holdout disjointness verified.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* test: add Stage C trigger, execution, merge, and partial failure tests

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: add cBioPortal + OMOP ingest pipeline and Databricks bridge

Adds `sema ingest` and `sema push` subcommands backed by a DuckDB staging
area. Parses cBioPortal clinical/MAF/timeline files and OMOP CDM DDL +
vocabulary CSVs into DuckDB, then pushes to Databricks via Arrow with
optional COPY INTO when a cloud staging URI is configured.

- `src/sema/ingest/`: cBioPortal + OMOP parsers, DuckDB staging, Databricks push
- `src/sema/cli_ingest.py`: click group wiring `ingest` and `push` commands
- `src/sema/models/config.py`: `IngestConfig`, `IngestOmopConfig`,
  `IngestDatabricksTargetConfig` with env-prefix settings
- `pyproject.toml`: adds `duckdb>=1.0.0` and `pyarrow>=14.0.0` runtime deps
- `.env.example`: documents INGEST_* env vars

Unit coverage across parsers, staging lifecycle, Databricks bridge
provisioning, and CLI wiring (63 tests).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* chore: gitignore .wolf/ OpenWolf context directory

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat: add sema eval CLI for dev-slice runner, diff, and milestone report

Makes rollout steps 2–6 of source-semantic-hardening executable with a
single command per step. Per-table assertion + telemetry dumps are
written when `eval_dump_dir` is set on BuildConfig; `slice_tables`
filters discovered work items to a named subset.

- `sema eval run --slice <yaml> --label <l> --output-dir <d>`: runs the
  pipeline on the slice and writes `<table>__<label>.json` + paired
  `<table>__<label>__telemetry.json` per table.
- `sema eval diff --baseline <d> --current <d>`: pairs dumps by table
  and aggregates semantic churn using the existing `diff_dumps` keyed
  on `(subject_ref, predicate)` — covers L2 and L3 assertions.
- `sema eval report --run <d> --label <l> [--baseline <d>]`: aggregates
  telemetry across tables and, if a baseline dir is given, folds in the
  churn summary — produces a milestone-ready JSON report.

Wiring:
- `BuildConfig` gains `eval_dump_dir`, `eval_config_label`,
  `slice_tables`.
- `process_table` accepts `eval_dump_dir`/`eval_config_label` and calls
  the dump hook after successful runs; failures to dump are logged, not
  raised.
- `_run_pipeline_stages` now returns `(assertions, staged_output)` so
  `process_table` can extract telemetry without a second pass.
- `_discover_tables` filters via `_filter_work_items_to_slice` when
  `slice_tables` is set.

Runbook at `docs/runbooks/source-semantic-eval.md` lists the exact
commands for tasks 5.6–5.7, 7.7–7.8, 8.7–8.9, 9.8–9.9, and 10.1–10.5.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* fix: populate real latency and token telemetry in staged L2 pipeline

TableTelemetry previously carried zeros for stage_*_latency_ms and
tokens_*. The staged engine's LLM calls were not measured and the
kwargs to from_stages() were never passed. Eval runs reported
avg_total_latency_ms: 0.0 and made the $0.10/table cost gate and
60s/table latency gate unverifiable.

- `LLMClient` gains an `InvocationStats` dataclass populated on every
  `invoke()`: wall-clock duration in ns, prompt/response char counts,
  and prompt/completion token counts (pulled from `usage_metadata` /
  `response_metadata` when present, else a ~4 chars/token estimate).
- `SemanticEngine.interpret_table_staged_with_metrics()` wraps the
  client's `invoke` for the duration of a staged run so that every
  batched Stage B call and every Stage C column call contributes to
  the table's `StageMetrics` (tokens per stage via accumulation,
  latency per stage via `time.monotonic_ns` bookends).
- `interpret_table_staged` preserved as a thin wrapper so existing
  callers are unchanged.
- `build_utils._run_semantic_interpretation` now threads the metrics
  through to `TableTelemetry.from_stages()`.

Verified end-to-end on 6-table dev slice: avg_total_latency_ms = 28533,
tokens input = 20983, tokens output = 23417 — well under the
cost/latency gates.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* chore(eval): add dev_slice_poc.yaml matching current Databricks cBioPortal ingest

Six tables actually loaded in workspace.cbioportal (patient, sample,
mutation, timeline_sample_acquisition, timeline_status,
timeline_treatment) — a subset of the 13 tables in dev_slice.yaml.
Used for initial rollout evaluation runs until full ingest lands.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* fix: sanitize LLM-leaked type suffix from Stage B column names

Caught in the step-2 dev-slice eval: Stage B occasionally returned
column names with an embedded type spec (e.g. 'BIOTYPE (STRING)',
'age [INT]', 'patient_id: VARCHAR'). The merge step uses the column
field verbatim to build '<table_ref>.<col>' subject refs, so a single
noisy column silently severed the link between L2 property
assertions and the extractor's COLUMN_EXISTS assertions. The
downstream effect was a 'regression_risk' removal in the diff tool.

Adds `sanitize_column_name` (strips the first whitespace / paren /
bracket / colon onward) and applies it to every StageBColumnResult
returned by `_invoke_stage_b_batch` before it reaches the merge or
vocab-context builder. LLM non-determinism occasionally skips the
leak entirely (step 3 domain-aware had zero) but the fix is cheap
insurance and costs nothing on clean output.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* fix(few-shot): add synonyms to Stage B examples and compact JSON format

Step 4 dev-slice eval showed 52 alias regressions vs step 3 (domain-
aware). Root cause: none of the 12 Stage B few-shot examples in
few_shot.py populated a `synonyms` field. The LLM imitated the
examples' empty-by-omission pattern and dropped aliases that step 3
was emitting.

Changes:
- Add realistic `synonyms` lists to 8 of 12 Stage B examples
  (patient_id, sample_id, gender, tmb, msi_type, hugo_symbol,
  variant_classification, agent, stage_highest). Examples without
  synonyms remain to demonstrate empty-is-valid.
- Switch `format_examples` to compact JSON (no indent) — recoups most
  of the token cost added by the synonyms.

Measured impact on 6-table dev slice:
- Alias regression 52 → 16 (−69%)
- Output tokens 22,935 → 23,566 (+631, LLM restored alias emission)
- Input tokens 41,623 → 41,148 (−475, compact JSON)
- All 6 tables still B_SUCCESS with 100% coverage

The +17k input token bump from enabling few-shot in step 4 is the
fixed cost of including the full Stage A+B+C blocks in each of 18+
LLM calls per slice run — not a bug, just the price of few-shot.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* eval: add dev-slice rollout artifacts for steps 2–5

Versions the per-table assertion dumps, telemetry dumps, diff reports,
and milestone reports produced during the source-semantic-hardening
rollout. These back up the task completion claims in
openspec/changes/source-semantic-hardening/tasks.md (which is in a
gitignored path) and serve as a reference baseline for future
evaluation runs.

Contents of eval-runs/:
- step2-baseline-single-pass/  # pre-decomposition reference
- step2-staged-zeroshot/        # A→B decomposition, zero-shot
- step3-domain-aware/           # + domain bias / type inventory / vocab hints
- step4-few-shot/               # + healthcare few-shot (post alias-fix)
- step5-stage-c/                # + Stage C value decoding (full pipeline)
- step{2,3,4,5}-diff.json       # churn summaries vs prior step
- step{2,3,4,5}-report.json     # per-step milestone reports
- end-to-end-diff.json          # baseline → full pipeline delta

Scope: the 6-table POC slice (eval/dev_slice_poc.yaml) reflecting
current Databricks ingest. Holdout and full-corpus runs blocked on
ingest of the remaining 27 cBioPortal tables — see §11-bis in
tasks.md.

eval-runs/*.log added to .gitignore (transient runtime output).

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat(ingest): add cBioPortal SV, CNA, gene-panel-matrix, and resource parsers

Extends the cBioPortal ingest to cover five new file types, unlocking
the remaining dev-slice tables (structural_variant, cna,
gene_panel_matrix, resource_definition/patient, clinical_supp_*).

New parsers:
- parse_sv_file — data_sv.txt → structural_variant (position/
  entrez-gene-id columns typed as BIGINT via sv_column_type helper)
- parse_cna_file — data_cna.txt (gene×sample wide matrix) pivoted to
  long format with sample_id / hugo_symbol / entrez_gene_id /
  cna_value. Blank cells become nulls. cna_long_format_rows helper
  lives in cbioportal_utils.py.
- parse_gene_panel_matrix — data_gene_panel_matrix.txt as-is
- parse_resource_file — data_resource_*.txt (definition and
  per-patient/sample entries)

Ingest orchestration:
- _should_download now allows data_sv.txt, data_cna.txt,
  data_gene_panel_matrix.txt, data_resource_*, data_clinical_supp_*
  via DOWNLOAD_EXACT_FILENAMES / DOWNLOAD_PREFIXES /
  EXCLUDED_DOWNLOAD_PREFIXES constants in cbioportal_utils.py
- SKIP_FILENAME_PATTERNS narrowed to only truly unsupported matrix
  files (expression, methylation, log2/linear/armlevel CNA, mrna,
  rppa)
- _ingest_study_dir wires three new fixed-file parsers
  (_try_ingest_fixed_files) plus prefix-matched passes for
  data_resource_* and data_clinical_supp_*
  (_ingest_prefix_matched_files)

Verified end-to-end against gbm_tcga_pan_can_atlas_2018: DuckDB now
holds 12 cbioportal tables including cna (14.4M long-format rows
pivoted from ~24k genes × ~600 samples), structural_variant (510
rows), gene_panel_matrix, resource_definition/patient, and
clinical_supp_hypoxia.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* eval: expand dev slice to 12 tables, re-run full pipeline on GBM ingest

Now that the cBioPortal ingest has been extended to cover SV, CNA,
gene-panel matrix, resources, and clinical supplements, the dev slice
grows from the original 6-table POC (patient, sample, mutation, 3
timelines) to 12 tables sourced from gbm_tcga_pan_can_atlas_2018.

Full A→B→C staged pipeline results on all 12 tables:

- 12/12 B_SUCCESS, 100% raw and critical coverage across every table
- 0 retries, 0 splits, 0 rescues — zero recovery overhead
- 69 Stage C calls → 195 has_decoded_value assertions
- 259 has_property_name assertions (up from 222 on the 6-table slice)
- Avg latency 25.2s / table (peak 105s on mutation's 114 columns,
  still under the 60s gate)
- Total cost $0.0160 for all 12 tables ($0.0013/table — 77× under
  the $0.10/table gate)

Spot-checks on the four new table types:
- structural_variant: correct entity "Structural Variant" with grain
  "one row per structural variant ... per sample"; Stage C correctly
  decoded in-frame vs frameshift mutation semantics
- cna (long format): 4 columns classified as sample_id /
  hugo_symbol / entrez_gene_id / cna_value, one Stage C call
- gene_panel_matrix, resource_definition, resource_patient: all
  identifier-heavy tables classified as expected

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* refactor: remove deprecated single-pass and two-pass L2 code (Task 11)

The staged A→B→C pipeline is proven on the 12-table dev slice and
becomes the sole L2 path. Ripping out everything the rollout kept
around through step 6.

Removed from src/sema/engine/semantic.py:
- PropertyInterpretation and TableInterpretation (old response schemas)
- _PropertyBatchResult (two-pass batch schema)
- build_interpretation_prompt, build_simplified_interpretation_prompt
- build_summary_prompt, build_property_prompt
- _needs_two_pass, _interpret_two_pass
- _interpret_via_llm_client, _interpret_via_raw_llm
- _run_summary_pass, _run_property_pass
- _entity_assertions, _property_assertions,
  _interpretation_to_assertions
- SemanticEngine.__init__(..., llm=...) raw-LLM legacy kwarg

Removed files:
- src/sema/engine/semantic_utils.py (entire file — all legacy helpers)
- tests/unit/test_two_pass_semantic.py (legacy path tests)

Reshaped:
- SemanticEngine.interpret_table now delegates to
  interpret_table_staged_with_metrics and returns just assertions —
  one staged path for every table regardless of width
- pipeline.build_utils._run_semantic_interpretation drops the
  use_staged branch; always returns (assertions, _StagedOutput)
- pipeline.build._run_pipeline_stages returns
  (assertions, staged_output) unconditionally
- process_table, _spawn_workers*, and BuildConfig lose the
  use_staged flag; cli_eval drops --use-staged/--no-use-staged
- Tests updated to mock the staged sequence (StageAResult +
  StageBBatchResult) instead of TableInterpretation

Test suite: 1004/1004 passing, mypy clean on 94 source files.
Test count dropped from 1041 → 1004 (the 37 removed tests all
exercised the deprecated legacy path).

Follow-up not addressed here: semantic.py (520) and build_utils.py
(508) both exceed the project's 400-line file standard. They were
already over (745 and 514 pre-cleanup). Splitting them is a separate
refactor — the simplest next step is extracting
interpret_table_staged_with_metrics + the Stage A/B/C runners into
stage_utils.py, which shaves ~200 lines from semantic.py.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* eval: post-cleanup sanity run on 12-table slice

Validates that Task 11 refactor (legacy L2 code removal, 17-file diff,
-1494 LOC) did not alter pipeline behavior.

Results vs pre-cleanup step 5 v2:
- 12/12 tables B_SUCCESS, 100% coverage, zero recovery
- 259 has_property_name (identical to pre-cleanup)
- 12 has_entity_name (identical)
- 140 vs 195 has_decoded_value, 62 vs 69 Stage C calls — stochastic
  LLM variation well within run-to-run noise
- Cost /bin/zsh.016 identical; latency 278s vs 302s (8% faster, noise)
- Diff: 23 added / 22 removed — symmetric, indicates zero regression

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* eval: verification run after Neo4j wipe + Task 11 cleanup

Full staged pipeline on the 12-table slice, Neo4j wiped first.

Pipeline:
- 12/12 tables B_SUCCESS @ 100% raw and critical coverage, zero recovery
- 12 entities, 259 properties, 174 decoded values, 81 Stage C calls
- 285s total / 23.8s avg, tokens 73,346 in + 34,614 out
- Cost $0.0159 ($0.0013/table, $0.00006/column) — 77× under budget

Neo4j state (3,755 nodes after materialization):
- Catalog/Schema/DataSource: 1 each
- Table: 12 ✓
- Entity: 12 (semantically correct: 'Biospecimen/Sample', 'Copy Number
  Alteration', 'Somatic Mutation', 'Structural Variant', 'Patient Hypoxia
  Assessment', 'Patient Status Event', 'Sample Acquisition Event',
  'Sample Genomic Profile Availability', 'Treatment Event', etc.)
- Column: 259 ✓
- Property: 259 ✓
- ValueSet: 150 / Term: 290 (from Stage C)
- Alias: 452 / Vocabulary: 143 (from L3)
- Assertion provenance: 2,175
- Edges: HAS_PROPERTY, PROPERTY_ON_COLUMN, ENTITY_ON_TABLE,
  CLASSIFIED_AS, HAS_VALUE_SET, MEMBER_OF, REFERS_TO — all present

Diff vs pre-cleanup baseline (step5-stage-c-v2):
- 45 added, 24 removed, 678 changed
- Added: 18 aliases + 27 decoded values (Stage C picked more columns)
- Removed: 14 decoded values + 10 aliases (LLM variation)
- Zero high-value regressions (no property_name / semantic_type /
  entity_name losses)

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* docs(eval): add step 6 milestone summary for 12-table POC slice

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* refactor: extract cBioPortal ingest and slice YAMLs to showcase/cbioportal_to_omop/

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* feat(few-shot): add generic base layer and split domain packs into modules

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

* ci: drop single-entry python matrix so test context matches branch rule

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

---------

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: extract cBioPortal to showcase/cbioportal_to_omop/

1 participant