Skip to content

feat(observability): Layer-2 on_error hooks, Layer-3 status() introspection#32

Merged
maltsev-dev merged 1 commit into
masterfrom
feat/observability-layer-2-3
Jun 24, 2026
Merged

feat(observability): Layer-2 on_error hooks, Layer-3 status() introspection#32
maltsev-dev merged 1 commit into
masterfrom
feat/observability-layer-2-3

Conversation

@maltsev-dev

Copy link
Copy Markdown
Member

Summary

Builds on the Layer-1 structured exception hierarchy from
#31. Three deliverables: a nullrun.observability package,
two new nullrun.* public API entry points, and per-error-code
docs.

What lands in this PR

1. nullrun.observability package (was observability.py)

File Responsibility
__init__.py Public re-exports; previous single-file surface preserved.
error_hooks.py Global hook registry — thread-safe register / unregister / dispatch. Multiple hooks fire in registration order. Hook exceptions are caught and logged at DEBUG. has_hooks() short-circuit keeps the hot path zero-cost when no hook is registered.
status.py NullRunStatus (frozen dataclass) + RecentError ring buffer (capacity 10) + WorkflowState enum. State derivation covers four headline buckets: ok / degraded / offline / misconfigured.

2. New public API on nullrun

  • nullrun.on_error(hook) — Layer 2 entry point. Fires for every
    structured NullRunError before the exception propagates so the
    call stack is still live. Skipped for WorkflowKilledInterrupt
    (BaseException subclass — kill is a signal, not an error). Returns
    an idempotent unregister callable.
  • nullrun.status() — Layer 3 entry point. Synchronous,
    thread-safe, side-effect-free snapshot of runtime state. Raises
    NullRunConfigError (NR-C004) if no runtime has been
    init()'d. Never lazily creates a runtime as a side effect (pinned
    by test_status_never_lazily_creates_runtime).
  • Both are added to __all__ so they show up in dir(nullrun) and
    tab-completion.

3. Docs (docs/errors/)

15 per-code pages (NR-A001..A003, NR-B001..B005, NR-C001/C003,
NR-L001, NR-R001, NR-T001, NR-W002/W003) plus a README.md
index. Each page documents:

  • the trigger conditions
  • the user_action (what to do next)
  • the retryable hint
  • a small reproducer / fix snippet

Plus docs/integration-baseline-2026-06-19.md pinning the next
integration run baseline.

4. Source wiring

  • runtime.py_emit_sdk_error / _emit_for_transport_error
    wire error_hooks.emit_error into the two SDK failure paths. The
    status() builder reads runtime state and feeds the recent-errors
    ring buffer.
  • transport.py — failed batches emit NullRunBackendError
    (retryable=True) through the new path so the hook sees the
    correlation_id from the gateway in ErrorContext.
  • decorators.py@protect catches the structured
    NullRunBlockedException family and emits with stage='tool' so
    a hook can attribute the failure to the right gate.
  • __init__.py — wires the new entry points, registers the
    curated symbols in __all__, and keeps the PEP-562 lazy export
    table backward-compatible.

Tests

File Pins
tests/test_error_hooks.py Registry basics (register returns unregister, idempotent unregister), emit_error (fires with both args, swallows hook exceptions, one-bad-hook-isolated, unregister-mid-dispatch is safe), ErrorContext validation, the WorkflowKilledInterrupt and WorkflowKilledException bypass rules, and that the global nullrun.on_error shim is wired through.
tests/test_status.py No-runtime raises NR-C004, with-runtime snapshot is frozen / equality-stable, key prefix is truncated to 10 chars, state derivation (ok / degraded / misconfigured), recent-errors ring buffer (capacity 10, fed by _emit_sdk_error).
tests/test_integration_contract.py track_event setdefault race pinned against the locked helper.
tests/test_dead_code_removed.py test_dir_size_unchanged rewritten to key off nullrun.__all__ (source of truth) instead of a hardcoded symbol count — the curated-surface contract is still pinned (no rogue globals leak into dir()), but legitimate additions to the curated surface no longer break the test.

Back-compat

  • from nullrun.observability import X (the old single-file surface)
    keeps working — the package's __init__ re-exports the same names.
  • All Layer-1 except clauses from feat(exceptions): Layer-1 structured exception hierarchy with NR-* error codes #31 still match.
  • The nullrun/__init__.py lazy-export table (_LAZY_EXPORTS) keeps
    every previously-importable name reachable — the new
    on_error / status / exception classes are additions, not
    replacements.

CI status (local verification on Windows / Python 3.14.2)

Step Result
pytest 926 passed, 13 skipped (0:09:52)
ruff check src/ tests/ All checks passed
mypy src/ No issues found in 26 source files

Stacking

This PR targets feat/layer-1-exception-hierarchy (not master) so
the diff is the observability work alone. #31 should merge first;
this branch will rebase cleanly after that.

Diff

26 files changed, 2236 insertions(+), 39 deletions(-).

Base automatically changed from feat/layer-1-exception-hierarchy to master June 24, 2026 09:29
…ection

Builds on the Layer-1 structured exception hierarchy (PR #31).
Three deliverables in this commit:

1) nullrun.observability package
   - error_hooks.py: global hook registry with thread-safe
     register / unregister / dispatch. Multiple hooks fire in
     registration order. Hook exceptions are caught and logged
     at DEBUG — a misbehaving hook cannot break the SDK.
     has_hooks() short-circuit keeps the hot path zero-cost
     when nothing is registered.
   - status.py: NullRunStatus dataclass (frozen) + RecentError
     ring buffer (capacity 10) + WorkflowState enum. State
     derivation covers four headline buckets: ok / degraded /
     offline / misconfigured. Per-instance state queries never
     mutate the runtime.
   - observability.py is renamed into the package (__init__.py
     keeps the previous public surface).

2) nullrun public API additions
   - on_error(hook) — Layer 2 entry point. Documented as
     'give the user a chance' to observe every structured
     failure before it propagates. Skipped for
     WorkflowKilledInterrupt (BaseException subclass) — kill
     is a signal, not an error.
   - status() — Layer 3 entry point. Returns a frozen
     NullRunStatus snapshot. Raises NullRunConfigError (NR-C004)
     if no runtime has been init()'d. Never lazily creates a
     runtime as a side effect (pinned by
     test_status_never_lazily_creates_runtime).
   - Both are added to __all__ so they appear in dir(nullrun)
     for discoverability.

3) Docs: docs/errors/
   - 15 per-code pages (NR-A001..A003, B001..B005, C001/C003,
     L001, R001, T001, W002/W003) plus README index. Each page
     documents the error_code, the trigger conditions, the
     user_action, and the retryable hint.
   - docs/integration-baseline-2026-06-19.md — pinned baseline
     for the next integration run.

4) Test updates
   - test_error_hooks.py — registry + dispatch + bypass tests
     (killed interrupt does not fire; one bad hook does not
     prevent later hooks; unregister is idempotent).
   - test_status.py — no-runtime / with-runtime / state
     derivation / recent-errors ring buffer.
   - test_integration_contract.py — track_event setdefault
     race pinned against the locked helper.
   - test_dead_code_removed.py::test_dir_size_unchanged —
     now keys off nullrun.__all__ (the source of truth for the
     curated surface) so the curated-surface contract is
     pinned without hardcoding the symbol count.

5) Source wiring
   - runtime.py — _emit_sdk_error / _emit_for_transport_error
     wire the new error_hooks.emit_error into the two SDK
     failure paths. status() builder reads runtime state and
     feeds the recent-errors ring buffer.
   - transport.py — failed batches emit
     NullRunBackendError (retryable=True) through the new path
     so retries surface the correlation_id in the
     ErrorContext.
   - decorators.py — @Protect catches the structured
     NullRunBlockedException family and emits with stage='tool'
     so a hook can attribute the failure to the right gate.

Verified locally on Windows / Python 3.14.2:
  pytest        926 passed, 13 skipped
  ruff check    clean on src/ and tests/
  mypy src/     clean on 26 source files
@maltsev-dev maltsev-dev force-pushed the feat/observability-layer-2-3 branch from 5617ab1 to 001b6c9 Compare June 24, 2026 09:31
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.35943% with 58 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/nullrun/runtime.py 64.76% 24 Missing and 13 partials ⚠️
src/nullrun/observability/status.py 80.24% 7 Missing and 9 partials ⚠️
src/nullrun/transport.py 64.28% 4 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@maltsev-dev maltsev-dev merged commit f9cc345 into master Jun 24, 2026
5 checks passed
@maltsev-dev maltsev-dev deleted the feat/observability-layer-2-3 branch June 24, 2026 09:55
maltsev-dev added a commit that referenced this pull request Jun 24, 2026
Bump version 0.6.0 → 0.6.1. This release lands all three layers
of the 'give the user a chance' design on top of the 0.6.0 P0
hardening pass:

  * Layer 1 — structured exception hierarchy. Every public SDK
    exception inherits from NullRunError and carries
    error_code / user_action / retryable / docs_url / cause.
    Five new typed classes (NullRunConfigError, NullRunAuthError,
    NullRunBackendError, NullRunBudgetError, NullRunToolBlockedError)
    are subclasses of the existing user-facing classes, so every
    'except' clause from 0.6.0 keeps matching.

  * Layer 2 — nullrun.on_error() global error hook. Fires for
    every structured NullRunError before the exception
    propagates. Skipped for WorkflowKilledInterrupt (BaseException
    subclass — kill is a signal, not an error). Multiple hooks
    fire in registration order; hook exceptions are caught and
    logged at DEBUG. has_hooks() short-circuit keeps the hot
    path zero-cost when no hook is registered.

  * Layer 3 — nullrun.status() introspection. Synchronous,
    thread-safe, side-effect-free snapshot of runtime state.
    Returns a frozen NullRunStatus dataclass with one of four
    headline states (ok / degraded / offline / misconfigured).
    Raises NullRunConfigError (NR-C004) if no runtime has been
    init()'d — never lazily creates a runtime as a side effect.

Per-code docs in docs/errors/ (15 pages + README index).
New tests pin the hierarchy, the hook semantics, the snapshot
fields, and the recent-errors ring buffer.

TestPyPI: the previous 0.6.0 (uploaded 2026-06-23, before
#31 and #32 landed) is yanked separately so the new 0.6.1
wheel can be uploaded. The yank is a TestPyPI-side action;
it does not change the source tree.
maltsev-dev added a commit that referenced this pull request Jun 24, 2026
Bump version 0.6.0 → 0.6.1. This release lands all three layers
of the 'give the user a chance' design on top of the 0.6.0 P0
hardening pass:

  * Layer 1 — structured exception hierarchy. Every public SDK
    exception inherits from NullRunError and carries
    error_code / user_action / retryable / docs_url / cause.
    Five new typed classes (NullRunConfigError, NullRunAuthError,
    NullRunBackendError, NullRunBudgetError, NullRunToolBlockedError)
    are subclasses of the existing user-facing classes, so every
    'except' clause from 0.6.0 keeps matching.

  * Layer 2 — nullrun.on_error() global error hook. Fires for
    every structured NullRunError before the exception
    propagates. Skipped for WorkflowKilledInterrupt (BaseException
    subclass — kill is a signal, not an error). Multiple hooks
    fire in registration order; hook exceptions are caught and
    logged at DEBUG. has_hooks() short-circuit keeps the hot
    path zero-cost when no hook is registered.

  * Layer 3 — nullrun.status() introspection. Synchronous,
    thread-safe, side-effect-free snapshot of runtime state.
    Returns a frozen NullRunStatus dataclass with one of four
    headline states (ok / degraded / offline / misconfigured).
    Raises NullRunConfigError (NR-C004) if no runtime has been
    init()'d — never lazily creates a runtime as a side effect.

Per-code docs in docs/errors/ (15 pages + README index).
New tests pin the hierarchy, the hook semantics, the snapshot
fields, and the recent-errors ring buffer.

TestPyPI: the previous 0.6.0 (uploaded 2026-06-23, before
#31 and #32 landed) is yanked separately so the new 0.6.1
wheel can be uploaded. The yank is a TestPyPI-side action;
it does not change the source tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant