feat(build): stage-b failure containment by deanban · Pull Request #96 · Nine-Sigma/sema

deanban · 2026-05-06T01:10:28Z

Summary

Stage-B failure containment work — six sections that change how the build pipeline classifies, recovers from, and persists evidence about failures during semantic interpretation:

Circuit breaker — service-health-only classification (§1). The breaker only trips on transport / 5xx failures; rate limits, JSON parse errors, Pydantic validation errors, and universal-parser ValueErrors no longer trigger cascade-skips of healthy tables.
Resume gating (§2). --resume now preserves prior assertions instead of running the schema-wipe loop. Verified on the populated graph: 37/37 tables marked skipped, assertions 6962→6962.
Engine-local LLM-attempt buffer (§3). Opt-in (gated on eval_dump_dir), reset per table, populated for both successful and failed batch invocations.
StageBFailureError carries staged context (§4). New typed exception subclass of LLMStageError carrying stage_a, stage_b, metrics, and llm_attempts so the failure-artifact writer reads everything off the exception — no engine reference required from the worker.
Forensic artifact persistence (§5). <table>__<label>__failure.json written on every failure path (Stage A failure, B_FAILED, circuit-open). Captures every LLMClient.invoke call across stages including failed batches; classification taxonomy aligned with the breaker.
Metadata-availability policy + B_PARTIAL outcome (§5b). New metadata_tier classifier (rich / sparse / name_only) computed before Stage A so the tier is on every failure artifact. Tier-keyed B_PARTIAL admission floor — rich tables stay at 0.75; sparse / name_only get a lowered floor (default 0.60) so partial commits are admitted instead of failing a coverage-floor check the source can't meet.
Run-level quality budget (§6). Two triggers: stage_b_failure_rate (default 30%) and run_non_contributing_rate (default 40%). Resume-skipped tables count toward the graph-contributing denominator but not the non-contributing trigger. --no-quality-budget disables both ceilings; QualityBudgetExceeded exits with code 7.

Verification

uv run pytest -q — 1351 passed / 1 skipped / 38 deselected
uv run mypy src/sema/ — clean across 112 files
Coverage 89% (quality_budget.py 100%) — gate ≥85%
Live resume smoke against the populated graph (.runs/build_resume_smoke_20260505_163145.log): 37/37 skipped, no circuit-breaker skips on healthy LLM, no quality-budget abort, assertions preserved
Deliberate-failure smoke for the forensic artifact: step_errors[].exception_type discriminates JSONDecodeError vs ValidationError within the content_failure classification; llm_attempts carries failed-batch prompt text; both StageBFailureError (→ semantic_coverage) and plain LLMStageError (→ content_failure) paths verified

Related issues

Related to feat: full 33-table cBioPortal corpus sign-off (blocked on ingest) #72 (full 33-table cBioPortal corpus sign-off — blocked on ingest). Today's verification exercised the full 37-table corpus across cbioportal_gbm_tcga_pan_can_atlas_2018 + cbioportal_msk_chord_2024; bias check + sign-off still owed under that issue.
Related to feat: Databricks GPT-5.x endpoints unblock + re-run eval Measurement A #81 (Databricks GPT-5.x endpoints unblock + re-run eval Measurement A). Confirms the AI Gateway and 54mini endpoint are reachable from sema's provider=custom code path; Measurement A re-run still owed under that issue.

Follow-ups surfaced by this PR

feat: regenerate .wolf/anatomy.md to cover src/ and tests/ trees #91 — regenerate .wolf/anatomy.md to cover src/ and tests/ trees (scanner scope misconfiguration)
feat: optionally split stage-b-failure-containment bundled commit into 7 conventional commits #92 — optionally split this PR's bundled commit into 7 conventional commits (review-time decision)
feat: investigate sparse-tier dominance on cBioPortal source contract #93 — investigate sparse-tier dominance on cBioPortal source contract (24/26 tables landed sparse)
feat: rate-limit L3 column LLM calls to stay under gateway RPM cap #94 — rate-limit L3 column LLM calls to stay under gateway RPM cap (26+ 429s observed at 1 worker)
feat: implement self-verification stage in L2 hybrid prompting #95 — implement self-verification stage in L2 hybrid prompting (vision coverage at 3/5 hybrid techniques)

Test plan

uv run pytest -q green locally
uv run mypy src/sema/ clean locally
Coverage ≥85% locally
Resume smoke (assertion preservation, no cascade skips, no premature quality-budget abort)
Deliberate-failure smoke (artifact completeness + step_errors sub-type discrimination)
CI green on this PR

Restricts the service circuit breaker to service-health failures only, gates the resume wipe correctly, persists forensic artifacts on every failed table, classifies metadata-availability tiers with a B_PARTIAL admission floor for sparse/name_only tables, and adds a run-level quality budget that aborts on graph-health or run-reliability storms. - circuit-breaker: opt-in service-health classification (5xx, transport, timeout); content failures, rate limits, and unknowns no longer trip. - resume: schema-wipe loop gated on `not config.resume`. Repaired the latent _try_resume materialize call to pass source_schema so resume actually engages without Neo4j MERGE rejecting null source_schema. - diagnostics: SemanticEngine carries an opt-in LLMAttempt buffer populated around every invoke (stage_a, per-batch stage_b including failed batches, stage_c). LLMStageError gains llm_attempts; new StageBFailureError(LLMStageError) carries stage_a/stage_b/metrics so the failure-artifact writer reads off the exception, not the engine. - failure-diagnostics: dump_table_failure_artifact writes *__failure.json with prompts, prompt hashes, raw responses, step errors, unresolved columns, retry/split/rescue counters, failure_classification (service_health|rate_limit|content_failure| semantic_coverage|circuit_open|unknown), and metadata_tier. Best- effort: artifact write errors log WARN and never block the build. - metadata-availability-policy: pure source-agnostic classifier (rich|sparse|name_only) on L1 evidence shape only. determine_b_status becomes tier-keyed; rich keeps the 0.75 floor, sparse/name_only admit B_PARTIAL at 0.60 (configurable). B_PARTIAL counts as succeeded for the graph-health budget; per-tier counts on the report. - run-quality-budget: stateless check post-_collect_results, two triggers (stage_b_failure_rate over graph-contributing denominator; run_non_contributing_rate over run-marginal denominator). Single QualityBudgetExceeded with trigger discriminator; stable ordering. CLI gains --no-quality-budget; exit code 7 distinguishes budget abort from other failures. Smoke-tested against cbioportal_gbm_tcga_pan_can_atlas_2018 with --resume: 10 tables skipped via resume, mutation committed B_PARTIAL at raw=0.67 (sparse tier), resource_definition failed below partial floor, complete failure artifact written with stage_a output, six captured llm_attempts, prompt hashes, and unresolved-column tiers. Tests: 1351 unit pass; mypy clean; coverage 89%. Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>

deanban merged commit 335ed95 into main May 6, 2026
3 checks passed

deanban deleted the dean/feat/stage-b-failure-containment branch May 6, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(build): stage-b failure containment#96

feat(build): stage-b failure containment#96
deanban merged 1 commit into
mainfrom
dean/feat/stage-b-failure-containment

deanban commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deanban commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Related issues

Follow-ups surfaced by this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deanban commented May 6, 2026 •

edited

Loading