feat(observability): Layer-2 on_error hooks, Layer-3 status() introspection#32
Merged
Merged
Conversation
…ection Builds on the Layer-1 structured exception hierarchy (PR #31). Three deliverables in this commit: 1) nullrun.observability package - error_hooks.py: global hook registry with thread-safe register / unregister / dispatch. Multiple hooks fire in registration order. Hook exceptions are caught and logged at DEBUG — a misbehaving hook cannot break the SDK. has_hooks() short-circuit keeps the hot path zero-cost when nothing is registered. - status.py: NullRunStatus dataclass (frozen) + RecentError ring buffer (capacity 10) + WorkflowState enum. State derivation covers four headline buckets: ok / degraded / offline / misconfigured. Per-instance state queries never mutate the runtime. - observability.py is renamed into the package (__init__.py keeps the previous public surface). 2) nullrun public API additions - on_error(hook) — Layer 2 entry point. Documented as 'give the user a chance' to observe every structured failure before it propagates. Skipped for WorkflowKilledInterrupt (BaseException subclass) — kill is a signal, not an error. - status() — Layer 3 entry point. Returns a frozen NullRunStatus snapshot. Raises NullRunConfigError (NR-C004) if no runtime has been init()'d. Never lazily creates a runtime as a side effect (pinned by test_status_never_lazily_creates_runtime). - Both are added to __all__ so they appear in dir(nullrun) for discoverability. 3) Docs: docs/errors/ - 15 per-code pages (NR-A001..A003, B001..B005, C001/C003, L001, R001, T001, W002/W003) plus README index. Each page documents the error_code, the trigger conditions, the user_action, and the retryable hint. - docs/integration-baseline-2026-06-19.md — pinned baseline for the next integration run. 4) Test updates - test_error_hooks.py — registry + dispatch + bypass tests (killed interrupt does not fire; one bad hook does not prevent later hooks; unregister is idempotent). - test_status.py — no-runtime / with-runtime / state derivation / recent-errors ring buffer. - test_integration_contract.py — track_event setdefault race pinned against the locked helper. - test_dead_code_removed.py::test_dir_size_unchanged — now keys off nullrun.__all__ (the source of truth for the curated surface) so the curated-surface contract is pinned without hardcoding the symbol count. 5) Source wiring - runtime.py — _emit_sdk_error / _emit_for_transport_error wire the new error_hooks.emit_error into the two SDK failure paths. status() builder reads runtime state and feeds the recent-errors ring buffer. - transport.py — failed batches emit NullRunBackendError (retryable=True) through the new path so retries surface the correlation_id in the ErrorContext. - decorators.py — @Protect catches the structured NullRunBlockedException family and emits with stage='tool' so a hook can attribute the failure to the right gate. Verified locally on Windows / Python 3.14.2: pytest 926 passed, 13 skipped ruff check clean on src/ and tests/ mypy src/ clean on 26 source files
5617ab1 to
001b6c9
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
maltsev-dev
added a commit
that referenced
this pull request
Jun 24, 2026
Bump version 0.6.0 → 0.6.1. This release lands all three layers
of the 'give the user a chance' design on top of the 0.6.0 P0
hardening pass:
* Layer 1 — structured exception hierarchy. Every public SDK
exception inherits from NullRunError and carries
error_code / user_action / retryable / docs_url / cause.
Five new typed classes (NullRunConfigError, NullRunAuthError,
NullRunBackendError, NullRunBudgetError, NullRunToolBlockedError)
are subclasses of the existing user-facing classes, so every
'except' clause from 0.6.0 keeps matching.
* Layer 2 — nullrun.on_error() global error hook. Fires for
every structured NullRunError before the exception
propagates. Skipped for WorkflowKilledInterrupt (BaseException
subclass — kill is a signal, not an error). Multiple hooks
fire in registration order; hook exceptions are caught and
logged at DEBUG. has_hooks() short-circuit keeps the hot
path zero-cost when no hook is registered.
* Layer 3 — nullrun.status() introspection. Synchronous,
thread-safe, side-effect-free snapshot of runtime state.
Returns a frozen NullRunStatus dataclass with one of four
headline states (ok / degraded / offline / misconfigured).
Raises NullRunConfigError (NR-C004) if no runtime has been
init()'d — never lazily creates a runtime as a side effect.
Per-code docs in docs/errors/ (15 pages + README index).
New tests pin the hierarchy, the hook semantics, the snapshot
fields, and the recent-errors ring buffer.
TestPyPI: the previous 0.6.0 (uploaded 2026-06-23, before
#31 and #32 landed) is yanked separately so the new 0.6.1
wheel can be uploaded. The yank is a TestPyPI-side action;
it does not change the source tree.
maltsev-dev
added a commit
that referenced
this pull request
Jun 24, 2026
Bump version 0.6.0 → 0.6.1. This release lands all three layers
of the 'give the user a chance' design on top of the 0.6.0 P0
hardening pass:
* Layer 1 — structured exception hierarchy. Every public SDK
exception inherits from NullRunError and carries
error_code / user_action / retryable / docs_url / cause.
Five new typed classes (NullRunConfigError, NullRunAuthError,
NullRunBackendError, NullRunBudgetError, NullRunToolBlockedError)
are subclasses of the existing user-facing classes, so every
'except' clause from 0.6.0 keeps matching.
* Layer 2 — nullrun.on_error() global error hook. Fires for
every structured NullRunError before the exception
propagates. Skipped for WorkflowKilledInterrupt (BaseException
subclass — kill is a signal, not an error). Multiple hooks
fire in registration order; hook exceptions are caught and
logged at DEBUG. has_hooks() short-circuit keeps the hot
path zero-cost when no hook is registered.
* Layer 3 — nullrun.status() introspection. Synchronous,
thread-safe, side-effect-free snapshot of runtime state.
Returns a frozen NullRunStatus dataclass with one of four
headline states (ok / degraded / offline / misconfigured).
Raises NullRunConfigError (NR-C004) if no runtime has been
init()'d — never lazily creates a runtime as a side effect.
Per-code docs in docs/errors/ (15 pages + README index).
New tests pin the hierarchy, the hook semantics, the snapshot
fields, and the recent-errors ring buffer.
TestPyPI: the previous 0.6.0 (uploaded 2026-06-23, before
#31 and #32 landed) is yanked separately so the new 0.6.1
wheel can be uploaded. The yank is a TestPyPI-side action;
it does not change the source tree.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on the Layer-1 structured exception hierarchy from
#31. Three deliverables: a
nullrun.observabilitypackage,two new
nullrun.*public API entry points, and per-error-codedocs.
What lands in this PR
1.
nullrun.observabilitypackage (wasobservability.py)__init__.pyerror_hooks.pyhas_hooks()short-circuit keeps the hot path zero-cost when no hook is registered.status.pyNullRunStatus(frozen dataclass) +RecentErrorring buffer (capacity 10) +WorkflowStateenum. State derivation covers four headline buckets:ok/degraded/offline/misconfigured.2. New public API on
nullrunnullrun.on_error(hook)— Layer 2 entry point. Fires for everystructured
NullRunErrorbefore the exception propagates so thecall stack is still live. Skipped for
WorkflowKilledInterrupt(BaseException subclass — kill is a signal, not an error). Returns
an idempotent
unregistercallable.nullrun.status()— Layer 3 entry point. Synchronous,thread-safe, side-effect-free snapshot of runtime state. Raises
NullRunConfigError(NR-C004) if no runtime has beeninit()'d. Never lazily creates a runtime as a side effect (pinnedby
test_status_never_lazily_creates_runtime).__all__so they show up indir(nullrun)andtab-completion.
3. Docs (
docs/errors/)15 per-code pages (
NR-A001..A003,NR-B001..B005,NR-C001/C003,NR-L001,NR-R001,NR-T001,NR-W002/W003) plus aREADME.mdindex. Each page documents:
user_action(what to do next)retryablehintPlus
docs/integration-baseline-2026-06-19.mdpinning the nextintegration run baseline.
4. Source wiring
runtime.py—_emit_sdk_error/_emit_for_transport_errorwire
error_hooks.emit_errorinto the two SDK failure paths. Thestatus()builder reads runtime state and feeds the recent-errorsring buffer.
transport.py— failed batches emitNullRunBackendError(
retryable=True) through the new path so the hook sees thecorrelation_idfrom the gateway inErrorContext.decorators.py—@protectcatches the structuredNullRunBlockedExceptionfamily and emits withstage='tool'soa hook can attribute the failure to the right gate.
__init__.py— wires the new entry points, registers thecurated symbols in
__all__, and keeps the PEP-562 lazy exporttable backward-compatible.
Tests
tests/test_error_hooks.pyemit_error(fires with both args, swallows hook exceptions, one-bad-hook-isolated, unregister-mid-dispatch is safe),ErrorContextvalidation, theWorkflowKilledInterruptandWorkflowKilledExceptionbypass rules, and that the globalnullrun.on_errorshim is wired through.tests/test_status.pyNR-C004, with-runtime snapshot is frozen / equality-stable, key prefix is truncated to 10 chars, state derivation (ok / degraded / misconfigured), recent-errors ring buffer (capacity 10, fed by_emit_sdk_error).tests/test_integration_contract.pytrack_eventsetdefaultrace pinned against the locked helper.tests/test_dead_code_removed.pytest_dir_size_unchangedrewritten to key offnullrun.__all__(source of truth) instead of a hardcoded symbol count — the curated-surface contract is still pinned (no rogue globals leak intodir()), but legitimate additions to the curated surface no longer break the test.Back-compat
from nullrun.observability import X(the old single-file surface)keeps working — the package's
__init__re-exports the same names.exceptclauses from feat(exceptions): Layer-1 structured exception hierarchy with NR-* error codes #31 still match.nullrun/__init__.pylazy-export table (_LAZY_EXPORTS) keepsevery previously-importable name reachable — the new
on_error/status/ exception classes are additions, notreplacements.
CI status (local verification on Windows / Python 3.14.2)
pytestruff check src/ tests/mypy src/Stacking
This PR targets
feat/layer-1-exception-hierarchy(notmaster) sothe diff is the observability work alone. #31 should merge first;
this branch will rebase cleanly after that.
Diff
26 files changed, 2236 insertions(+), 39 deletions(-).