Skip to content

fix(ws): verify HMAC on signed_payload, dispatch from trusted source#2

Merged
maltsev-dev merged 3 commits into
masterfrom
fix/ws-byte-mismatch-verify-signed-payload
Jun 18, 2026
Merged

fix(ws): verify HMAC on signed_payload, dispatch from trusted source#2
maltsev-dev merged 3 commits into
masterfrom
fix/ws-byte-mismatch-verify-signed-payload

Conversation

@maltsev-dev

Copy link
Copy Markdown
Member

Problem

The byte-mismatch bug: server SignedWsMessage::new signed
serde_json::to_string(&message) (the inner WsMessage) but SDK
verify_hmac_signature hashed the full wire bytes including
signature/timestamp/api_key_id. Bytes never matched -> all
HMAC-signed WS messages were silently dropped at
transport_websocket.py:313.

Net effect: the entire control plane (KILL/PAUSE via dashboard)
was silently down for every Phase 139+ key.

A second security issue lurks behind the obvious fix: the
envelope uses #[serde(flatten)] for the inner WsMessage, so
data["state"], data["workflow_id"], etc. are accessible
both on the outer body and inside signed_payload. If the
SDK had simply fixed the verify path to use bytes.fromhex(signed_payload)
and kept dispatching from the outer body, an attacker who
captured a benign state="Normal" message could splice a
forged state="Killed" into the outer body, the signature
would still verify (it's over the inner bytes), and the
dispatcher would happily raise WorkflowKilledInterrupt. The
unit test test_replayed_signed_payload_with_spliced_body_is_rejected
catches exactly this; the fix is to dispatch from the trusted
(parsed signed_payload) source, never the outer body, for any
state-change decision.

Fix

Two changes in src/nullrun/transport_websocket.py:

  1. _handle_message: when the message carries a signed_payload
    field, parse it once into a trusted dict and verify the
    signature against bytes.fromhex(signed_payload). The
    pre-FIX-C legacy fallback (verify against full wire bytes)
    is kept only for servers that do not yet ship the
    signed_payload field, so this commit is backwards
    compatible with the historical message shape.

  2. _handle_state_change_with_ack: takes the new trusted
    parameter; when present, reads state / workflow_id /
    version / message_id from trusted, not from the
    outer body. The signature over the inner bytes now binds
    those values semantically — a captured pair cannot be
    re-targeted at a different workflow state.

  3. ACKNOWLEDGED_STATES flipped from {"killed", "paused"} to
    {"Killed", "Paused"} (PascalCase). The server's
    WsWorkflowState enum emits PascalCase, so the lowercase
    set was dead — the ACK path was always no-op, and the
    server's pending_acks queue grew without ever draining.
    See commit 73f3197 for the standalone S-2 commit (kept
    separate so the security fix is reviewable on its own).

Tests

tests/test_ws_signed_payload.py (9 tests, all green):

  • test_compute_and_verify_hmac_round_trip — round-trip +
    wrong-secret + wrong-payload rejection
  • test_verify_hmac_signature_rejects_expired_timestamp
    timestamp window enforced
  • test_hex_round_trip_preserves_signed_bytes — the
    signed_payload hex field decodes to exactly the bytes the
    signature was computed over
  • test_state_change_with_signed_payload_is_dispatched — a
    full SignedWsMessage envelope is accepted, the
    on_state_change callback fires, and an ACK is sent
  • test_tampered_signed_payload_is_rejected — flipping a single
    hex nibble drops the message
  • test_pre_fix_legacy_envelope_without_signed_payload_is_rejected
    the historical (pre-fix) server shape is still rejected
    rather than silently accepted
  • test_malformed_signed_payload_does_not_crash — non-hex
    signed_payload falls through to the legacy verify path
    and is rejected by the signature check, no exception
  • test_replayed_signed_payload_with_spliced_body_is_rejected
    the security property: an attacker who captured a benign
    signed pair can re-trigger its captured state but cannot
    escalate to a forged one
  • test_acknowledged_states_use_pascalcase — pins the S-2
    contract

Full SDK suite: 443 passed, 13 skipped, 0 failed.

Risk

  • Backwards compatible: servers that do not ship
    signed_payload keep working on the legacy path
    (they were already broken, so this is strictly better).
  • The dispatcher now reads from the trusted source. If the
    server is upgraded to a build that does ship
    signed_payload but keeps a bug that makes the inner bytes
    diverge from the outer body, the dispatcher will see
    different values than the dashboard. The unit tests pin
    the no-divergence assumption via the round-trip test.

Counterpart

The backend side of this fix lives in
nullrunio/nullrun fix/ws-byte-mismatch-signed-payload
branch — the new signed_payload field. It must be merged
together with this PR
, otherwise the SDK will fail signature
verification on every message this backend sends.

The SDK's import chain (nullrun.__init__ -> nullrun.decorators ->
nullrun.instrumentation.langgraph -> 'from langchain_core.callbacks
import BaseCallbackHandler') runs at pytest *collection* time, not at a
specific test. With CI installing [dev] only, every test in the suite
errored on collection with:

    ModuleNotFoundError: No module named 'langchain_core'

This is the same class of bug that 'nullrun[langgraph]' exists to
prevent for end users, except the dev install never benefited from
the extras indirection.

Fix: add 'langchain-core>=0.3,<1.0' to the [dev] extras. The
heavier 'langgraph' / 'langchain' extras pull in stacks the unit
tests don't use; the bare core is the smallest dep that makes the
import chain resolve and unblocks test collection on every
supported Python (3.10 / 3.11 / 3.12) on every PR.

Validation: locally on Python 3.14.2 (which is outside the
3.10/3.11/3.12 matrix that CI tests), 'pip install -e .[dev]'
followed by 'pytest tests/' runs 443/443 + 9/9 new byte-mismatch
unit tests, no collection error. CI will re-confirm on the 3.10 /
3.11 / 3.12 matrix.
Counterpart of NULLRUN fix(ws-control) (commit 5e2f65b). The
backend now embeds the exact bytes that were HMAC-signed in a
separate signed_payload field. The SDK:

  1. Verifies the signature against bytes.fromhex(signed_payload),
     falling back to the legacy wire-bytes path only when the
     field is absent (pre-FIX-C servers).
  2. Dispatches state changes from the parsed signed_payload
     bytes, not from the outer envelope body. This closes a
     security hole: an attacker who captured a (signed_payload,
     signature) pair from a benign 'state=Normal' event could
     otherwise splice a forged 'state=Killed' into the outer body
     and the signature would still verify, because the signature
     covers only the signed_payload bytes. Reading dispatch state
     from the trusted source keeps the captured signature
     semantically bound to its captured body.

Tests in test_ws_signed_payload.py cover:
  - round-trip, wrong-secret, tampered-payload rejection
  - malformed signed_payload does not crash
  - replay-with-spliced-body: signature still verifies, but the
    dispatched state is the captured one (not the forged one) -
    the attack is harmless
  - replays where the attacker also rewrites signed_payload are
    rejected via signature mismatch

Note: the two ACK tests are still failing because
ACKNOWLEDGED_STATES is still lowercase. That is fixed separately
by S-2 in the same release - kept as a separate commit so the
byte-mismatch/security fix is reviewable on its own.
The server's WsWorkflowState enum (NULLRUN/backend/src/proxy/http/
ws_control.rs) emits 'Killed' / 'Paused' (PascalCase). The SDK was
comparing against {'killed', 'paused'} (lowercase), so the ACK path
was dead and the server's pending-ack queue grew without ever
being drained.

This unblocks the two remaining failing tests in
test_ws_signed_payload.py:
  - test_state_change_with_signed_payload_is_dispatched (now sends
    the ACK that the server expects)
  - test_acknowledged_states_use_pascalcase (now matches server
    casing)

With byte-mismatch FIX-C in place (commits 5e2f65b + 105fb80), the
KILL/PAUSE path now works end-to-end:
  1. server signs the inner message and embeds the bytes in
     signed_payload
  2. server sends the envelope (flattened WsMessage + signature +
     timestamp + api_key_id + signed_payload)
  3. SDK verifies signature against bytes.fromhex(signed_payload)
  4. SDK dispatches from the trusted source (parsed signed_payload),
     so a captured (signed_payload, signature) pair can only
     re-trigger its captured state, never a forged one
  5. SDK sends ACK on Killed/Paused, draining server's pending-acks
@maltsev-dev maltsev-dev force-pushed the fix/ws-byte-mismatch-verify-signed-payload branch from 73f3197 to e4f66b2 Compare June 18, 2026 09:09
@maltsev-dev maltsev-dev merged commit 6573235 into master Jun 18, 2026
0 of 4 checks passed
maltsev-dev added a commit that referenced this pull request Jun 20, 2026
…erage reporter (#26)

* fix: P0 security/stability hardening bundle

Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).

Security / PCI-DSS / GDPR

- P0-1: Mask positional PII in `_enforce_sensitive_tool` by introspecting
  the wrapped function's signature and applying `SENSITIVE_ARG_KEYS` to
  positional params. Pre-fix, `charge("4111-…-1111", 50)` forwarded the
  PAN into `/execute` and the audit log.
- P0-6 / P3-3: `_safe_repr` now redacts BEFORE truncating. The pre-fix
  order truncated first, so `details={…}` past position 50 leaked
  verbatim. `_safe_repr` is now the single source of truth for the
  redact-then-truncate flow.

Cost-audit / reliability

- P0-3: Bounded chunked reads on the sync + async httpx transports
  (`MAX_RESPONSE_BYTES`, default 16 MiB, `NULLRUN_MAX_RESPONSE_BYTES`
  env override). Above the cap, tracking is skipped and
  `_coverage_streaming_skipped` is incremented. Replaces the
  `response.read()` / `await response.aread()` unbounded buffer that
  held entire LLM streaming bodies in memory.
- P0-4: `_do_flush_locked` re-queue on CB OPEN now drops the NEWEST
  non-critical events instead of the oldest. The oldest events
  (incident start, billing-period start) are exactly what a billing
  investigator needs; losing them silently broke monthly rollups.
  Control-plane events (`state_change`, `kill_received`,
  `policy_invalidated`, `key_rotated`) are preserved unconditionally
  so the dashboard KILL switch lands even under sustained backend
  outage.

Identity

- S-8 / P2-4: `agent()` now emits `str(uuid.uuid4())` (with dashes).
  Pre-fix the format was `f"agent-{uuid.uuid4().hex}"` — 32 hex chars,
  no dashes — and backend UUID-typed columns dropped these to NULL
  on insert. User-supplied names are still preserved verbatim.
- §7.2 #16: `workflow()` context manager now resets `span_id` (not
  only `workflow_id` / `trace_id`) so nested `with span()` blocks
  don't leave the inner span_id visible inside the workflow scope.

Resource leaks

- S-9: `_active_runs` on `NullRunCallback` is now an `OrderedDict`
  capped at 4096 with FIFO eviction. Pre-fix the dict grew
  unbounded when `on_chain_end` did not fire (some LangChain
  versions short-circuit the end hook on chain-body errors).
- S-10: WebSocket reconnect loop is now capped at 10 consecutive
  failures, then falls back to HTTP-poll. Pre-fix the loop ran
  forever when the backend was permanently down, leaking the
  WS thread.

Transport

- §7.2 #6: Separate `hmac_verify_expired_total` counter so SRE can
  distinguish clock-skew (NTP drift) from forged packets. Mirrored
  in both the HTTP and WebSocket verify paths.
- §7.2 #35: `CircuitBreaker.call` now dispatches the OPEN→HALF_OPEN
  jitter through `_maybe_apply_open_jitter_sync` /
  `_maybe_apply_open_jitter_async`. Pre-fix the jitter used
  `time.sleep` before dispatching to async, which blocked the
  caller's event loop on every transition.
- P2-1: `_coverage_seen` now bumps in the httpx path (sync + async).
  Pre-fix the counter was only bumped by the `requests` transport,
  so the dashboard's coverage view was empty for the dominant
  OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.
- P2-3: `is_sensitive_tool` match is case-insensitive. Pre-fix
  `"stripe.charge"` did not match `"Stripe.Charge"`, bypassing the
  sensitive gate.

Concurrency

- §7.2 #39: New `_tools_lock` guards every mutation of
  `_strict_mode_tools` / `_sensitive_tools`. Same lock guards the
  coverage-counter bump+prune sequence (§7.2 #33) so two threads
  can't both observe the dict at length 4095 and both grow it to
  4097 before either prune lands.
- §7.2 #47: New `_langchain_lock` / `_langgraph_lock` guard the
  patch sequences end-to-end. Pre-fix two threads racing through
  `auto_instrument` could both pass the early `_x_patched` check
  and double-wrap `BaseCallbackManager` / `Pregel`.
- §7.2 #33: `_COVERAGE_CAP` (4096) bounds the per-host coverage
  dicts.

Webhook delivery

- P3-2: Exponential backoff (0.5s, 1s, 2s, 4s, 8s, 16s, 30s cap)
  replaces the previous linear schedule. Linear didn't back off
  fast enough under sustained outage — each KILL/PAUSE spawned
  its own delivery thread, producing 1000+ spinning threads
  hammering the dead endpoint.

WAL crash-recovery

- P1-5b: Atomic WAL writes (tmp + `fsync` + `os.replace`), 64 MiB
  rotation with `os.replace(wal, wal.1)`, replay drains both
  `wal.1` and `wal`. New `NULLRUN_WAL_PATH` / `NULLRUN_WAL_MAX_BYTES`
  env overrides for containers with `readOnlyRootFilesystem: true`.

Tests

8 new regression test files (57 tests total):
  test_agent_id_uuid.py, test_args_pii_masked.py,
  test_streaming_oom_cap.py, test_lru_active_runs.py,
  test_reconnect_cap.py, test_coverage_seen_httpx.py,
  test_webhook_backoff.py, test_redact.py

`test_buffer_invariants.py` extended with drop-newest +
critical-event preservation cases. `test_release_polish.py`
updated to pin the 5s cap on both the sync and async jitter
helpers (post §7.2 #35 split).

Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.

* fix: address ruff lint findings from CI

Three CI lint failures on `ruff check src/` — fixes only, no
behavioural changes:

- **B905** (`src/nullrun/decorators.py:162`): `zip(bound_params,
  args)` now passes `strict=False` explicitly. Pre-fix the two
  iterables can be different lengths — `bound_params` is sliced to
  `[: len(args)]` but the function may have fewer positional
  parameters than args provided (e.g. *args-style callables), in
  which case the trailing loop below handles the excess. `strict=`
  was implicit and triggered B905. Now explicit so the intent is
  documented in code.

- **I001** (`src/nullrun/instrumentation/auto.py:1146`): the late
  `import os as _os` was moved to the top-of-file import block as
  `import os` (alphabetical order: hashlib, json, logging, os,
  threading). The `_os` alias was only there to avoid shadowing —
  there is no top-level `os` in scope, so the plain name is fine.
  Call site updated to use `os.environ.get(...)`.

- **S108** (`src/nullrun/transport.py:632`): replaced the
  hardcoded `/tmp/nullrun.wal` with
  `os.path.join(tempfile.gettempdir(), "nullrun.wal")`. The
  hardcoded `/tmp` flagged S108 (insecure / non-portable temp
  path) and would have broken the SDK on Windows out of the box.
  `gettempdir()` returns the OS-appropriate temp dir
  (`/tmp` on Linux, `/var/folders/...` on macOS, `%TEMP%` on
  Windows). `NULLRUN_WAL_PATH` env override still wins, so
  containers with `readOnlyRootFilesystem: true` are unaffected.
  Added `import tempfile` to the top-of-file imports.

Verified:
  - `ruff check src/` → All checks passed!
  - `mypy src/` → Success: no issues found in 23 source files
  - `pytest` → 493 passed, 13 skipped (CI default, no `-W error`)

* chore(release): bump to 0.5.2

- Promote [Unreleased] to [0.5.2] — 2026-06-19; merge the two
  [Unreleased] sections that had drifted during Sprint 2.5 +
  Phase 0 development so release tooling scanning for the
  [Unreleased] anchor picks up the complete change set exactly
  once.
- Add PEP 561 marker (py.typed) — the package ships inline type
  annotations; the marker tells mypy / pyright / pylance to honour
  them.
- runtime.py (S-4): case-insensitive state compare in
  check_control_plane. Defensive against any backend casing drift
  beyond the current PascalCase (handlers.rs:9258). Pinned by
  tests/test_state_compare_case_insensitive.py (10 cases covering
  PascalCase / UPPERCASE / lowercase / mixed-case).

Working-notes file docs/integration-baseline-2026-06-19.md is
deliberately left untracked, matching the analyze.md pattern from
d74712e.

* test: bump coverage 70.92% → 84.52% with branch coverage

Lifts the SDK's Codecov score from 70.92 % to 84.52 % (+13.6 pp) by
adding 347 new tests across 10 files that exercise previously-untested
branches in the auto-instrumentation patches, runtime gates, transport
fallback modes, circuit breaker Redis path, and the @Protect decorator
fail-CLOSED contract.

pyproject.toml
  - Enable branch coverage so error / fallback paths count.
  - Raise fail_under from 70 → 82 (enforced in CI via `coverage run -m
    pytest && coverage report`).
  - Add precision=2 and skip_empty=true to keep the report readable.

New tests (all 817 pass locally, all 4 CI jobs green):

  tests/test_autogen_patch.py          — 13 tests
  tests/test_crewai_patch.py           — 15 tests
  tests/test_llama_index_patch.py      — 13 tests
  tests/test_langgraph_callback.py     — 38 tests
  tests/test_auto_requests.py          — 24 tests
  tests/test_runtime_branches.py       — 43 tests
  tests/test_transport_branches.py     — 44 tests
  tests/test_circuit_breaker_branches.py — 31 tests
  tests/test_protect_branches.py       — 43 tests
  tests/test_actions_context_init.py   — 50 tests

Per-file coverage deltas:

  instrumentation/autogen.py        21.33 → 93.41 %
  instrumentation/crewai.py         22.97 → 90.82 %
  instrumentation/llama_index.py    28.30 → 100.00 %
  instrumentation/langgraph.py      23.75 → 93.69 %
  instrumentation/auto_requests.py  33.72 → 99.09 %
  breaker/circuit_breaker.py        59.76 → 90.21 %
  transport.py                     82.57 → 84.79 %
  transport_websocket.py           68.70 → 64.10 % (msg-type branches
                                                  still need live ws
                                                  round-trip tests)
  decorators.py                    83.33 → 95.49 %
  runtime.py                       80.14 → 83.24 %
  context.py                       82.76 → 100.00 %
  actions.py                       92.12 → 96.89 %
  breaker/exceptions.py             98.51 → 97.26 %

All 4 CI jobs pass locally (pytest, ruff check, mypy, coverage).

Working-notes file docs/integration-baseline-2026-06-19.md is
deliberately left untracked, matching the analyze.md pattern from
d74712e.

* feat(security): make @sensitive registration fail-CLOSED (ADR-008)

Sensitive-tool registration is part of the security boundary. The
old behaviour caught any exception from _get_or_create_runtime(),
logged it at DEBUG, and returned the original function unchanged —
which meant the wrapped body would later execute without ever being
added to the runtime's sensitive-tool set, completely bypassing the
pre-execution gate under partial initialization (e.g. transient
NullRunAuthenticationError on import).

Replace the silent logger.debug(...) with raise RuntimeError(...,
chained from the original exception. The decorator is the registration
point, not the call site, so raising at decoration time is the correct
signal: the import / module-load fails loudly, the body never gets a
chance to run untracked, and the caller can still inspect the root
cause via __cause__.

The two pre-existing tests pinned the old (silent / wrong-type) contract;
update them to assert the new RuntimeError wrapping:
  - test_sensitive_raises_on_missing_api_key now expects RuntimeError
    whose __cause__ is the original NullRunAuthenticationError.
  - test_sensitive_runtime_init_failure_is_silent is renamed to
    ..._raises and asserts the same __cause__ chaining when a
    _get_or_create_runtime mock raises.

* fix(transport): retry /track/batch on 5xx and align auth-verify path (P0 #2, P0 #5)

P0 #2 — _send_batch_with_retry_info used to do a single
self._client.post(...) + raise_for_status(). A transient backend 5xx
raised out of the flush path; the in-memory buffer was cleared at the
call site and every event in the batch was permanently lost. Wrap the
post() in _retry_with_backoff (max 3 attempts, exponential backoff +
jitter, capped at 10s) so a single 500 no longer drops the whole batch.
429 is retried (helper honors Retry-After when present); other 4xx
errors are returned as-is — those are real client bugs and must not
be retried (e.g. a 401 just wastes the user's budget).

P0 #5 — contract drift: this file's auth-verify call site used
/auth/verify, while the corresponding call in runtime.py:599 already
used /api/v1/auth/verify. Align the rotation call site to /api/v1/auth/verify
so the contract-drift-guard CI catches any future divergence.

Update tests/test_transport.py::test_retry_on_500 to assert the new
contract (third attempt succeeds → call_count == 3, event id in
accepted_event_ids) instead of expecting an immediate exception.
Add tests/test_track_batch_retry.py with full regression coverage:
single 5xx → success, three consecutive 5xx → BreakerTransportError,
429 with Retry-After → honored before next attempt.

* feat(runtime): emit background coverage_report every 60s

The SDK has tracked per-host seen / tracked / streaming_skipped counters
since 0.4.x (bump_coverage_counter, get_coverage_stats), but there was
no path to ship them to the backend — the counters only ever existed
in process memory. This commit adds a daemon thread that emits a
coverage_report track event every 60 seconds so the backend can build
the per-host coverage dashboard.

* NullRunRuntime.track_coverage() — returns a track-result dict when
  there is something to report, or None on cold start (no counters
  bumped yet) so the backend doesn't get an empty row per minute.
* start_coverage_reporter() / stop_coverage_reporter() — idempotent
  lifecycle, daemon thread, sleeps in 0.5s slices for responsive
  shutdown, emits once on entry so short-lived processes (CI, batch
  jobs) still leave a row.
* nullrun.init() wires start_coverage_reporter() in; the reporter is
  a no-op while the process is still cold, so re-init is safe.

New tests/test_coverage_report.py pins the contract: cold start → None,
post-traffic → track-result dict with type=coverage_report and the three
counter dicts, start is idempotent, stop joins cleanly.

* chore(breaker): add __main__ shim so 'python -m nullrun.breaker' exits cleanly

Historically the SDK shipped a 'python -m nullrun.breaker' entry point
for in-container health probes and ad-hoc debugging. The nullrun.breaker
subpackage is the circuit-breaker + policy-exceptions surface — it is
not a runnable command. Without this shim, containerized deployments
that scripted 'python -m nullrun.breaker' as a no-op smoke check would
fail with 'No module named nullrun.breaker.__main__'.

This module makes that invocation exit cleanly (return 0) and print a
short pointer to nullrun-doctor (nullrun.toolbox.diagnostics) for
real runtime checks.

* chore: gitignore audit.md (project-local working notes, sibling of analyze.md)

* test: re-align @sensitive test with fail-CLOSED contract after master merge

The auto-merge of master into this branch (commit 7875210) resolved
tests/test_protect_branches.py by taking master's side of the conflict,
leaving the old test_sensitive_runtime_init_failure_is_silent in place.
That test asserts @sensitive does NOT raise — but the production
change in commit 58263a1 (this branch) makes @sensitive raise
RuntimeError (fail-CLOSED, ADR-008). Result: CI ran the old assertion
against the new production code and failed.

Restore the renamed and re-asserted version of the test from commit
58263a1 — test_sensitive_runtime_init_failure_raises — so the test
asserts the new contract: RuntimeError is raised and __cause__ chains
the original exception.

runtime.py was resolved correctly by the auto-merge (both sides kept:
the new track_coverage / start_coverage_reporter / stop_coverage_reporter
/ _coverage_reporter_loop methods AND the existing bump_coverage_counter
are all present), so no changes there.
maltsev-dev added a commit that referenced this pull request Jun 21, 2026
* fix: P0 security/stability hardening bundle

Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).

Security / PCI-DSS / GDPR

- P0-1: Mask positional PII in `_enforce_sensitive_tool` by introspecting
  the wrapped function's signature and applying `SENSITIVE_ARG_KEYS` to
  positional params. Pre-fix, `charge("4111-…-1111", 50)` forwarded the
  PAN into `/execute` and the audit log.
- P0-6 / P3-3: `_safe_repr` now redacts BEFORE truncating. The pre-fix
  order truncated first, so `details={…}` past position 50 leaked
  verbatim. `_safe_repr` is now the single source of truth for the
  redact-then-truncate flow.

Cost-audit / reliability

- P0-3: Bounded chunked reads on the sync + async httpx transports
  (`MAX_RESPONSE_BYTES`, default 16 MiB, `NULLRUN_MAX_RESPONSE_BYTES`
  env override). Above the cap, tracking is skipped and
  `_coverage_streaming_skipped` is incremented. Replaces the
  `response.read()` / `await response.aread()` unbounded buffer that
  held entire LLM streaming bodies in memory.
- P0-4: `_do_flush_locked` re-queue on CB OPEN now drops the NEWEST
  non-critical events instead of the oldest. The oldest events
  (incident start, billing-period start) are exactly what a billing
  investigator needs; losing them silently broke monthly rollups.
  Control-plane events (`state_change`, `kill_received`,
  `policy_invalidated`, `key_rotated`) are preserved unconditionally
  so the dashboard KILL switch lands even under sustained backend
  outage.

Identity

- S-8 / P2-4: `agent()` now emits `str(uuid.uuid4())` (with dashes).
  Pre-fix the format was `f"agent-{uuid.uuid4().hex}"` — 32 hex chars,
  no dashes — and backend UUID-typed columns dropped these to NULL
  on insert. User-supplied names are still preserved verbatim.
- §7.2 #16: `workflow()` context manager now resets `span_id` (not
  only `workflow_id` / `trace_id`) so nested `with span()` blocks
  don't leave the inner span_id visible inside the workflow scope.

Resource leaks

- S-9: `_active_runs` on `NullRunCallback` is now an `OrderedDict`
  capped at 4096 with FIFO eviction. Pre-fix the dict grew
  unbounded when `on_chain_end` did not fire (some LangChain
  versions short-circuit the end hook on chain-body errors).
- S-10: WebSocket reconnect loop is now capped at 10 consecutive
  failures, then falls back to HTTP-poll. Pre-fix the loop ran
  forever when the backend was permanently down, leaking the
  WS thread.

Transport

- §7.2 #6: Separate `hmac_verify_expired_total` counter so SRE can
  distinguish clock-skew (NTP drift) from forged packets. Mirrored
  in both the HTTP and WebSocket verify paths.
- §7.2 #35: `CircuitBreaker.call` now dispatches the OPEN→HALF_OPEN
  jitter through `_maybe_apply_open_jitter_sync` /
  `_maybe_apply_open_jitter_async`. Pre-fix the jitter used
  `time.sleep` before dispatching to async, which blocked the
  caller's event loop on every transition.
- P2-1: `_coverage_seen` now bumps in the httpx path (sync + async).
  Pre-fix the counter was only bumped by the `requests` transport,
  so the dashboard's coverage view was empty for the dominant
  OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.
- P2-3: `is_sensitive_tool` match is case-insensitive. Pre-fix
  `"stripe.charge"` did not match `"Stripe.Charge"`, bypassing the
  sensitive gate.

Concurrency

- §7.2 #39: New `_tools_lock` guards every mutation of
  `_strict_mode_tools` / `_sensitive_tools`. Same lock guards the
  coverage-counter bump+prune sequence (§7.2 #33) so two threads
  can't both observe the dict at length 4095 and both grow it to
  4097 before either prune lands.
- §7.2 #47: New `_langchain_lock` / `_langgraph_lock` guard the
  patch sequences end-to-end. Pre-fix two threads racing through
  `auto_instrument` could both pass the early `_x_patched` check
  and double-wrap `BaseCallbackManager` / `Pregel`.
- §7.2 #33: `_COVERAGE_CAP` (4096) bounds the per-host coverage
  dicts.

Webhook delivery

- P3-2: Exponential backoff (0.5s, 1s, 2s, 4s, 8s, 16s, 30s cap)
  replaces the previous linear schedule. Linear didn't back off
  fast enough under sustained outage — each KILL/PAUSE spawned
  its own delivery thread, producing 1000+ spinning threads
  hammering the dead endpoint.

WAL crash-recovery

- P1-5b: Atomic WAL writes (tmp + `fsync` + `os.replace`), 64 MiB
  rotation with `os.replace(wal, wal.1)`, replay drains both
  `wal.1` and `wal`. New `NULLRUN_WAL_PATH` / `NULLRUN_WAL_MAX_BYTES`
  env overrides for containers with `readOnlyRootFilesystem: true`.

Tests

8 new regression test files (57 tests total):
  test_agent_id_uuid.py, test_args_pii_masked.py,
  test_streaming_oom_cap.py, test_lru_active_runs.py,
  test_reconnect_cap.py, test_coverage_seen_httpx.py,
  test_webhook_backoff.py, test_redact.py

`test_buffer_invariants.py` extended with drop-newest +
critical-event preservation cases. `test_release_polish.py`
updated to pin the 5s cap on both the sync and async jitter
helpers (post §7.2 #35 split).

Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.

* fix: address ruff lint findings from CI

Three CI lint failures on `ruff check src/` — fixes only, no
behavioural changes:

- **B905** (`src/nullrun/decorators.py:162`): `zip(bound_params,
  args)` now passes `strict=False` explicitly. Pre-fix the two
  iterables can be different lengths — `bound_params` is sliced to
  `[: len(args)]` but the function may have fewer positional
  parameters than args provided (e.g. *args-style callables), in
  which case the trailing loop below handles the excess. `strict=`
  was implicit and triggered B905. Now explicit so the intent is
  documented in code.

- **I001** (`src/nullrun/instrumentation/auto.py:1146`): the late
  `import os as _os` was moved to the top-of-file import block as
  `import os` (alphabetical order: hashlib, json, logging, os,
  threading). The `_os` alias was only there to avoid shadowing —
  there is no top-level `os` in scope, so the plain name is fine.
  Call site updated to use `os.environ.get(...)`.

- **S108** (`src/nullrun/transport.py:632`): replaced the
  hardcoded `/tmp/nullrun.wal` with
  `os.path.join(tempfile.gettempdir(), "nullrun.wal")`. The
  hardcoded `/tmp` flagged S108 (insecure / non-portable temp
  path) and would have broken the SDK on Windows out of the box.
  `gettempdir()` returns the OS-appropriate temp dir
  (`/tmp` on Linux, `/var/folders/...` on macOS, `%TEMP%` on
  Windows). `NULLRUN_WAL_PATH` env override still wins, so
  containers with `readOnlyRootFilesystem: true` are unaffected.
  Added `import tempfile` to the top-of-file imports.

Verified:
  - `ruff check src/` → All checks passed!
  - `mypy src/` → Success: no issues found in 23 source files
  - `pytest` → 493 passed, 13 skipped (CI default, no `-W error`)

* chore(release): bump to 0.5.2

- Promote [Unreleased] to [0.5.2] — 2026-06-19; merge the two
  [Unreleased] sections that had drifted during Sprint 2.5 +
  Phase 0 development so release tooling scanning for the
  [Unreleased] anchor picks up the complete change set exactly
  once.
- Add PEP 561 marker (py.typed) — the package ships inline type
  annotations; the marker tells mypy / pyright / pylance to honour
  them.
- runtime.py (S-4): case-insensitive state compare in
  check_control_plane. Defensive against any backend casing drift
  beyond the current PascalCase (handlers.rs:9258). Pinned by
  tests/test_state_compare_case_insensitive.py (10 cases covering
  PascalCase / UPPERCASE / lowercase / mixed-case).

Working-notes file docs/integration-baseline-2026-06-19.md is
deliberately left untracked, matching the analyze.md pattern from
d74712e.

* test: bump coverage 70.92% → 84.52% with branch coverage

Lifts the SDK's Codecov score from 70.92 % to 84.52 % (+13.6 pp) by
adding 347 new tests across 10 files that exercise previously-untested
branches in the auto-instrumentation patches, runtime gates, transport
fallback modes, circuit breaker Redis path, and the @Protect decorator
fail-CLOSED contract.

pyproject.toml
  - Enable branch coverage so error / fallback paths count.
  - Raise fail_under from 70 → 82 (enforced in CI via `coverage run -m
    pytest && coverage report`).
  - Add precision=2 and skip_empty=true to keep the report readable.

New tests (all 817 pass locally, all 4 CI jobs green):

  tests/test_autogen_patch.py          — 13 tests
  tests/test_crewai_patch.py           — 15 tests
  tests/test_llama_index_patch.py      — 13 tests
  tests/test_langgraph_callback.py     — 38 tests
  tests/test_auto_requests.py          — 24 tests
  tests/test_runtime_branches.py       — 43 tests
  tests/test_transport_branches.py     — 44 tests
  tests/test_circuit_breaker_branches.py — 31 tests
  tests/test_protect_branches.py       — 43 tests
  tests/test_actions_context_init.py   — 50 tests

Per-file coverage deltas:

  instrumentation/autogen.py        21.33 → 93.41 %
  instrumentation/crewai.py         22.97 → 90.82 %
  instrumentation/llama_index.py    28.30 → 100.00 %
  instrumentation/langgraph.py      23.75 → 93.69 %
  instrumentation/auto_requests.py  33.72 → 99.09 %
  breaker/circuit_breaker.py        59.76 → 90.21 %
  transport.py                     82.57 → 84.79 %
  transport_websocket.py           68.70 → 64.10 % (msg-type branches
                                                  still need live ws
                                                  round-trip tests)
  decorators.py                    83.33 → 95.49 %
  runtime.py                       80.14 → 83.24 %
  context.py                       82.76 → 100.00 %
  actions.py                       92.12 → 96.89 %
  breaker/exceptions.py             98.51 → 97.26 %

All 4 CI jobs pass locally (pytest, ruff check, mypy, coverage).

Working-notes file docs/integration-baseline-2026-06-19.md is
deliberately left untracked, matching the analyze.md pattern from
d74712e.

* feat(security): make @sensitive registration fail-CLOSED (ADR-008)

Sensitive-tool registration is part of the security boundary. The
old behaviour caught any exception from _get_or_create_runtime(),
logged it at DEBUG, and returned the original function unchanged —
which meant the wrapped body would later execute without ever being
added to the runtime's sensitive-tool set, completely bypassing the
pre-execution gate under partial initialization (e.g. transient
NullRunAuthenticationError on import).

Replace the silent logger.debug(...) with raise RuntimeError(...,
chained from the original exception. The decorator is the registration
point, not the call site, so raising at decoration time is the correct
signal: the import / module-load fails loudly, the body never gets a
chance to run untracked, and the caller can still inspect the root
cause via __cause__.

The two pre-existing tests pinned the old (silent / wrong-type) contract;
update them to assert the new RuntimeError wrapping:
  - test_sensitive_raises_on_missing_api_key now expects RuntimeError
    whose __cause__ is the original NullRunAuthenticationError.
  - test_sensitive_runtime_init_failure_is_silent is renamed to
    ..._raises and asserts the same __cause__ chaining when a
    _get_or_create_runtime mock raises.

* fix(transport): retry /track/batch on 5xx and align auth-verify path (P0 #2, P0 #5)

P0 #2 — _send_batch_with_retry_info used to do a single
self._client.post(...) + raise_for_status(). A transient backend 5xx
raised out of the flush path; the in-memory buffer was cleared at the
call site and every event in the batch was permanently lost. Wrap the
post() in _retry_with_backoff (max 3 attempts, exponential backoff +
jitter, capped at 10s) so a single 500 no longer drops the whole batch.
429 is retried (helper honors Retry-After when present); other 4xx
errors are returned as-is — those are real client bugs and must not
be retried (e.g. a 401 just wastes the user's budget).

P0 #5 — contract drift: this file's auth-verify call site used
/auth/verify, while the corresponding call in runtime.py:599 already
used /api/v1/auth/verify. Align the rotation call site to /api/v1/auth/verify
so the contract-drift-guard CI catches any future divergence.

Update tests/test_transport.py::test_retry_on_500 to assert the new
contract (third attempt succeeds → call_count == 3, event id in
accepted_event_ids) instead of expecting an immediate exception.
Add tests/test_track_batch_retry.py with full regression coverage:
single 5xx → success, three consecutive 5xx → BreakerTransportError,
429 with Retry-After → honored before next attempt.

* feat(runtime): emit background coverage_report every 60s

The SDK has tracked per-host seen / tracked / streaming_skipped counters
since 0.4.x (bump_coverage_counter, get_coverage_stats), but there was
no path to ship them to the backend — the counters only ever existed
in process memory. This commit adds a daemon thread that emits a
coverage_report track event every 60 seconds so the backend can build
the per-host coverage dashboard.

* NullRunRuntime.track_coverage() — returns a track-result dict when
  there is something to report, or None on cold start (no counters
  bumped yet) so the backend doesn't get an empty row per minute.
* start_coverage_reporter() / stop_coverage_reporter() — idempotent
  lifecycle, daemon thread, sleeps in 0.5s slices for responsive
  shutdown, emits once on entry so short-lived processes (CI, batch
  jobs) still leave a row.
* nullrun.init() wires start_coverage_reporter() in; the reporter is
  a no-op while the process is still cold, so re-init is safe.

New tests/test_coverage_report.py pins the contract: cold start → None,
post-traffic → track-result dict with type=coverage_report and the three
counter dicts, start is idempotent, stop joins cleanly.

* chore(breaker): add __main__ shim so 'python -m nullrun.breaker' exits cleanly

Historically the SDK shipped a 'python -m nullrun.breaker' entry point
for in-container health probes and ad-hoc debugging. The nullrun.breaker
subpackage is the circuit-breaker + policy-exceptions surface — it is
not a runnable command. Without this shim, containerized deployments
that scripted 'python -m nullrun.breaker' as a no-op smoke check would
fail with 'No module named nullrun.breaker.__main__'.

This module makes that invocation exit cleanly (return 0) and print a
short pointer to nullrun-doctor (nullrun.toolbox.diagnostics) for
real runtime checks.

* chore: gitignore audit.md (project-local working notes, sibling of analyze.md)

* test: re-align @sensitive test with fail-CLOSED contract after master merge

The auto-merge of master into this branch (commit 7875210) resolved
tests/test_protect_branches.py by taking master's side of the conflict,
leaving the old test_sensitive_runtime_init_failure_is_silent in place.
That test asserts @sensitive does NOT raise — but the production
change in commit 58263a1 (this branch) makes @sensitive raise
RuntimeError (fail-CLOSED, ADR-008). Result: CI ran the old assertion
against the new production code and failed.

Restore the renamed and re-asserted version of the test from commit
58263a1 — test_sensitive_runtime_init_failure_raises — so the test
asserts the new contract: RuntimeError is raised and __cause__ chains
the original exception.

runtime.py was resolved correctly by the auto-merge (both sides kept:
the new track_coverage / start_coverage_reporter / stop_coverage_reporter
/ _coverage_reporter_loop methods AND the existing bump_coverage_counter
are all present), so no changes there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant