Skip to content

feat(build): stage-b failure containment#96

Merged
deanban merged 1 commit into
mainfrom
dean/feat/stage-b-failure-containment
May 6, 2026
Merged

feat(build): stage-b failure containment#96
deanban merged 1 commit into
mainfrom
dean/feat/stage-b-failure-containment

Conversation

@deanban

@deanban deanban commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Stage-B failure containment work — six sections that change how the build pipeline classifies, recovers from, and persists evidence about failures during semantic interpretation:

  • Circuit breaker — service-health-only classification (§1). The breaker only trips on transport / 5xx failures; rate limits, JSON parse errors, Pydantic validation errors, and universal-parser ValueErrors no longer trigger cascade-skips of healthy tables.
  • Resume gating (§2). --resume now preserves prior assertions instead of running the schema-wipe loop. Verified on the populated graph: 37/37 tables marked skipped, assertions 6962→6962.
  • Engine-local LLM-attempt buffer (§3). Opt-in (gated on eval_dump_dir), reset per table, populated for both successful and failed batch invocations.
  • StageBFailureError carries staged context (§4). New typed exception subclass of LLMStageError carrying stage_a, stage_b, metrics, and llm_attempts so the failure-artifact writer reads everything off the exception — no engine reference required from the worker.
  • Forensic artifact persistence (§5). <table>__<label>__failure.json written on every failure path (Stage A failure, B_FAILED, circuit-open). Captures every LLMClient.invoke call across stages including failed batches; classification taxonomy aligned with the breaker.
  • Metadata-availability policy + B_PARTIAL outcome (§5b). New metadata_tier classifier (rich / sparse / name_only) computed before Stage A so the tier is on every failure artifact. Tier-keyed B_PARTIAL admission floor — rich tables stay at 0.75; sparse / name_only get a lowered floor (default 0.60) so partial commits are admitted instead of failing a coverage-floor check the source can't meet.
  • Run-level quality budget (§6). Two triggers: stage_b_failure_rate (default 30%) and run_non_contributing_rate (default 40%). Resume-skipped tables count toward the graph-contributing denominator but not the non-contributing trigger. --no-quality-budget disables both ceilings; QualityBudgetExceeded exits with code 7.

Verification

  • uv run pytest -q — 1351 passed / 1 skipped / 38 deselected
  • uv run mypy src/sema/ — clean across 112 files
  • Coverage 89% (quality_budget.py 100%) — gate ≥85%
  • Live resume smoke against the populated graph (.runs/build_resume_smoke_20260505_163145.log): 37/37 skipped, no circuit-breaker skips on healthy LLM, no quality-budget abort, assertions preserved
  • Deliberate-failure smoke for the forensic artifact: step_errors[].exception_type discriminates JSONDecodeError vs ValidationError within the content_failure classification; llm_attempts carries failed-batch prompt text; both StageBFailureError (→ semantic_coverage) and plain LLMStageError (→ content_failure) paths verified

Related issues

Follow-ups surfaced by this PR

Test plan

  • uv run pytest -q green locally
  • uv run mypy src/sema/ clean locally
  • Coverage ≥85% locally
  • Resume smoke (assertion preservation, no cascade skips, no premature quality-budget abort)
  • Deliberate-failure smoke (artifact completeness + step_errors sub-type discrimination)
  • CI green on this PR

Restricts the service circuit breaker to service-health failures only,
gates the resume wipe correctly, persists forensic artifacts on every
failed table, classifies metadata-availability tiers with a B_PARTIAL
admission floor for sparse/name_only tables, and adds a run-level
quality budget that aborts on graph-health or run-reliability storms.

- circuit-breaker: opt-in service-health classification (5xx, transport,
  timeout); content failures, rate limits, and unknowns no longer trip.
- resume: schema-wipe loop gated on `not config.resume`. Repaired the
  latent _try_resume materialize call to pass source_schema so resume
  actually engages without Neo4j MERGE rejecting null source_schema.
- diagnostics: SemanticEngine carries an opt-in LLMAttempt buffer
  populated around every invoke (stage_a, per-batch stage_b including
  failed batches, stage_c). LLMStageError gains llm_attempts; new
  StageBFailureError(LLMStageError) carries stage_a/stage_b/metrics so
  the failure-artifact writer reads off the exception, not the engine.
- failure-diagnostics: dump_table_failure_artifact writes
  *__failure.json with prompts, prompt hashes, raw responses, step
  errors, unresolved columns, retry/split/rescue counters,
  failure_classification (service_health|rate_limit|content_failure|
  semantic_coverage|circuit_open|unknown), and metadata_tier. Best-
  effort: artifact write errors log WARN and never block the build.
- metadata-availability-policy: pure source-agnostic classifier
  (rich|sparse|name_only) on L1 evidence shape only. determine_b_status
  becomes tier-keyed; rich keeps the 0.75 floor, sparse/name_only admit
  B_PARTIAL at 0.60 (configurable). B_PARTIAL counts as succeeded for
  the graph-health budget; per-tier counts on the report.
- run-quality-budget: stateless check post-_collect_results, two
  triggers (stage_b_failure_rate over graph-contributing denominator;
  run_non_contributing_rate over run-marginal denominator). Single
  QualityBudgetExceeded with trigger discriminator; stable ordering.
  CLI gains --no-quality-budget; exit code 7 distinguishes budget
  abort from other failures.

Smoke-tested against cbioportal_gbm_tcga_pan_can_atlas_2018 with
--resume: 10 tables skipped via resume, mutation committed B_PARTIAL
at raw=0.67 (sparse tier), resource_definition failed below partial
floor, complete failure artifact written with stage_a output, six
captured llm_attempts, prompt hashes, and unresolved-column tiers.

Tests: 1351 unit pass; mypy clean; coverage 89%.
Signed-off-by: deanban <3989225+deanban@users.noreply.github.com>
@deanban deanban merged commit 335ed95 into main May 6, 2026
3 checks passed
@deanban deanban deleted the dean/feat/stage-b-failure-containment branch May 6, 2026 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant