feat: expand healthcare eval coverage#85
Merged
Conversation
Phase 1 of expand-healthcare-eval-coverage:
- ingest: sanitize_schema_name with sha256 truncation suffix; StudyRegistry
with fail-fast collision detection on differing original_study_ids
- ingest: registry-driven push discovery; Bridge default unions registered
schemas with known shared schemas; --discover-all-schemas opt-in;
--schemas / target_schemas treated as allowlist filter, not replacement
- parsers: parse_segmented_cna for .seg files; parse_lab_timeline types
VALUE as DOUBLE; gene_panel_matrix pivots wide→long; _ingest_study_dir
dispatches .seg files and detects lab timelines via header inspection
- few-shot: 4 new Stage A entities (lab timeline, procedure, performance
status, segmented CNA); 9 new Stage B examples (lab value/units, ECOG/
Karnofsky, PD-L1/MMR/Gleason, procedure code); 5 new Stage C decodings;
split into stage_{a,b,c} modules to keep files under 400 lines
- eval: msk_chord_dev (12 tables) and msk_chord_holdout (9 tables)
slices, contamination_map listing few-shot source tables, scripts/
check_slice_contamination.py + CI guard ensuring holdout disjointness
Sections 5, 6, 8, 10, 11 done. Sections 2/3/4 (graph layer source_schema
stamping + MERGE-key reconciliation), 12 (FK detector), 7/9/13/14
(legacy migration + live ingest + eval runs) deferred to follow-ups.
1116 unit tests passing, mypy clean, coverage 87.7%.
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…_utils extract Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
…hcare slices Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Add DDL helpers for re-applying column and table comments to existing Databricks tables, alongside identifier validation that rejects semicolons, backticks, and control characters in catalog/schema/table/ column names. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Add a typed planner/executor that diffs parser-extracted comments against live Databricks state and emits ALTER TABLE statements without touching data. Includes a registry-aware context resolver with distinct errors for unregistered-study vs. registered-but-missing-cache, plus a cBioPortal-specific extractor that walks the local source cache. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Wire the recovery primitive into a CLI subcommand mirroring sema ingest cbioportal ergonomics. Supports --dry-run, --force, --json, and full override set (--source-cache / --target-catalog / --target-schema) for registry-bypass mode. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
DuckDB's CREATE TABLE … AS SELECT * rebuild loses comments. Thread the cBioPortal parser back into _rename_schema via an optional comment_source callable; the default resolves IngestConfig.cache_dir / study_id and re-applies parsed column and table comments after each table copy. Cache-absent path logs a WARN with a pointer to the recover-comments command. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
Rate limit errors (429 / REQUEST_LIMIT_EXCEEDED) now use a longer base delay (10s) and larger multiplier (3x) than generic transient retries, and do not trip the circuit breaker — rate limiting is a quota signal, not a service-health failure. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-commit branch expanding the healthcare evaluation surface for sema. Highlights:
cbioportal_<study_id>schemas,_sema_study_registry, scoped-delete viasource_schemaon MERGE keys, registry-driven push discovery).--enable-fk-detection/--materialize-structural-fkflags.sema ingest recover-commentscommand +_rename_schemapatch so the DuckDB rename emulator preserves comments end-to-end. GBM was recovered live (61 cols + 12 table comments restored, 0 failed; matches legacy 24% baseline).REQUEST_LIMIT_EXCEEDEDerrors get a longer base delay (10s) and bigger multiplier (3×) than generic transients, and do not trip the circuit breaker.Commits
Verification
src/sema/.Test plan
uv run pytest -m unit+ mypy on the PRsema ingest recover-comments --study <study> --dry-runagainst their workspace if they want to validate the recovery shapescripts/migrate_cbioportal_to_namespaced.pyagainst a staging DuckDBCloses