Skip to content

feat: expand healthcare eval coverage#85

Merged
deanban merged 11 commits into
mainfrom
dean/explore/expand-healthcare-eval-coverage
May 5, 2026
Merged

feat: expand healthcare eval coverage#85
deanban merged 11 commits into
mainfrom
dean/explore/expand-healthcare-eval-coverage

Conversation

@deanban

@deanban deanban commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Multi-commit branch expanding the healthcare evaluation surface for sema. Highlights:

  • Multi-study namespacing (cbioportal_<study_id> schemas, _sema_study_registry, scoped-delete via source_schema on MERGE keys, registry-driven push discovery).
  • FK / join detector with three tiers (data-verified 0.95, cardinality-consistent 0.80, structural-only 0.70), wired into the build path with --enable-fk-detection / --materialize-structural-fk flags.
  • UC Volume staging + size-based COPY INTO routing — moved msk_chord cna push from ~10–20h to ~24m.
  • cBioPortal column-comment recovery — new sema ingest recover-comments command + _rename_schema patch so the DuckDB rename emulator preserves comments end-to-end. GBM was recovered live (61 cols + 12 table comments restored, 0 failed; matches legacy 24% baseline).
  • Rate-limit-aware LLM backoff — 429 / REQUEST_LIMIT_EXCEEDED errors get a longer base delay (10s) and bigger multiplier (3×) than generic transients, and do not trip the circuit breaker.
  • MSK CHORD parsers (case-insensitive header dedupe), timeline regex hyphen fix, healthcare few-shot expansion.

Commits

  • feat: study-namespacing infra + MSK CHORD parsers + healthcare few-shots
  • feat(engine): tier-based FK detection with warehouse profile lookup
  • feat(graph): multi-study namespacing with source_schema scoped delete
  • feat(pipeline): wire FK detection + namespaced schema discovery + cli_utils extract
  • feat(cbioportal): namespaced ingest, timeline regex hyphen fix, healthcare slices
  • feat(push): UC Volume staging + size-based COPY INTO routing
  • feat(ingest): build_alter_column_comment_sql primitive
  • feat(ingest): comment_recovery primitive (data-free)
  • feat(cli): sema ingest recover-comments command
  • fix(migrate): preserve column comments through schema rename
  • feat(llm): rate-limit-aware backoff and circuit-breaker classification

Verification

  • 1251 unit tests passing (1 skipped, 38 deselected).
  • mypy strict clean across src/sema/.
  • Coverage 89% (gate ≥85%).
  • GBM recovery executed live; SQL audit shows patient 38/38, sample 19/19, clinical_supp_hypoxia 4/4. L1 connector spot-check confirms restored descriptions are read.
  • MSK CHORD dry-run confirms 50 columns already at parity (0 ALTERs needed).

Test plan

  • CI runs uv run pytest -m unit + mypy on the PR
  • Reviewer dry-runs sema ingest recover-comments --study <study> --dry-run against their workspace if they want to validate the recovery shape
  • Reviewer verifies migration patch by deleting + re-running scripts/migrate_cbioportal_to_namespaced.py against a staging DuckDB

Closes

deanban added 11 commits May 5, 2026 00:49
Phase 1 of expand-healthcare-eval-coverage:

- ingest: sanitize_schema_name with sha256 truncation suffix; StudyRegistry
  with fail-fast collision detection on differing original_study_ids
- ingest: registry-driven push discovery; Bridge default unions registered
  schemas with known shared schemas; --discover-all-schemas opt-in;
  --schemas / target_schemas treated as allowlist filter, not replacement
- parsers: parse_segmented_cna for .seg files; parse_lab_timeline types
  VALUE as DOUBLE; gene_panel_matrix pivots wide→long; _ingest_study_dir
  dispatches .seg files and detects lab timelines via header inspection
- few-shot: 4 new Stage A entities (lab timeline, procedure, performance
  status, segmented CNA); 9 new Stage B examples (lab value/units, ECOG/
  Karnofsky, PD-L1/MMR/Gleason, procedure code); 5 new Stage C decodings;
  split into stage_{a,b,c} modules to keep files under 400 lines
- eval: msk_chord_dev (12 tables) and msk_chord_holdout (9 tables)
  slices, contamination_map listing few-shot source tables, scripts/
  check_slice_contamination.py + CI guard ensuring holdout disjointness

Sections 5, 6, 8, 10, 11 done. Sections 2/3/4 (graph layer source_schema
stamping + MERGE-key reconciliation), 12 (FK detector), 7/9/13/14
(legacy migration + live ingest + eval runs) deferred to follow-ups.

1116 unit tests passing, mypy clean, coverage 87.7%.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…_utils extract

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…hcare slices

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Add DDL helpers for re-applying column and table comments to existing
Databricks tables, alongside identifier validation that rejects
semicolons, backticks, and control characters in catalog/schema/table/
column names.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Add a typed planner/executor that diffs parser-extracted comments
against live Databricks state and emits ALTER TABLE statements without
touching data. Includes a registry-aware context resolver with distinct
errors for unregistered-study vs. registered-but-missing-cache, plus a
cBioPortal-specific extractor that walks the local source cache.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Wire the recovery primitive into a CLI subcommand mirroring sema ingest
cbioportal ergonomics. Supports --dry-run, --force, --json, and full
override set (--source-cache / --target-catalog / --target-schema) for
registry-bypass mode.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
DuckDB's CREATE TABLE … AS SELECT * rebuild loses comments. Thread the
cBioPortal parser back into _rename_schema via an optional
comment_source callable; the default resolves IngestConfig.cache_dir /
study_id and re-applies parsed column and table comments after each
table copy. Cache-absent path logs a WARN with a pointer to the
recover-comments command.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Rate limit errors (429 / REQUEST_LIMIT_EXCEEDED) now use a longer base
delay (10s) and larger multiplier (3x) than generic transient retries,
and do not trip the circuit breaker — rate limiting is a quota signal,
not a service-health failure.

Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
@deanban deanban merged commit f3b4378 into main May 5, 2026
3 checks passed
@deanban deanban deleted the dean/explore/expand-healthcare-eval-coverage branch May 5, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant