nullrunio · maltsev-dev · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -132,7 +132,14 @@ surface is unchanged. Aligns the SDK with the contracts in
   description of an older design that did not match the
   shipped SDK.
 
-## [Unreleased]
+## [0.5.2] — 2026-06-19
+
+This release bundles the Sprint 2.5 production-readiness hardening
+alongside the Phase 0 contract / lifecycle fixes. The two streams were
+shipped as separate `[Unreleased]` sections during development; they
+are merged here into a single canonical entry so release tooling that
+scans for the `[Unreleased]` anchor picks up the complete change set
+exactly once.
 
 ### Added (production-readiness hardening)
 
@@ -209,6 +216,26 @@ surface is unchanged. Aligns the SDK with the contracts in
   fingerprint used by `track_event` (the existing
   `_fingerprint_for` is for HTTP responses keyed on host+body+status).
 
+- **Async Policy Cache**: `AsyncTransport` now uses `PolicyCache` for CACHED fallback mode. Previously the async transport always fell back to PERMISSIVE when gateway was unreachable. Now it caches successful execute decisions and uses them when gateway is unavailable.
+
+- **Custom Sensitive Tools API**: Added `add_sensitive_tool()`, `remove_sensitive_tool()`, `register_sensitive_tools()`, and `get_sensitive_tools()` methods to `NullRunRuntime`. Users can now register custom tools as sensitive requiring strict mode enforcement.
+
+- **`NullRunBlockedException.tool_name` attribute** (FIX-5): The `tool_name`
+  kwarg is now a first-class attribute on `NullRunBlockedException`
+  (and its subclasses `LoopDetectedException`, etc.) instead of being
+  absorbed into `**details`. Cookbook examples that read `exc.tool_name`
+  no longer raise `AttributeError`. Backwards-compatible: `tool_name`
+  defaults to `None` and does not appear in `exc.details` when unset.
+  The stringified exception now includes `tool={name}` when set.
+
+- **`check_control_plane` is case-insensitive on the state value.**
+  SDK now normalises the state with `.lower()` before comparing to
+  `"paused"` / `"killed"`. Pre-fix a backend regression to UPPERCASE
+  (e.g. `"KILLED"` in `state_change`) would have silently failed the
+  match and let a killed workflow keep running. Backend already emits
+  PascalCase per the `as_pascal_case()` normaliser in
+  `handlers.rs:9258`; this is defensive per `analyze.md` §11.6.
+
 ### Removed (Phase 5)
 
 - **Empty placeholder modules deleted.** `src/nullrun/flow/`,
@@ -238,32 +265,88 @@ surface is unchanged. Aligns the SDK with the contracts in
   behalf) to `X-API-Key`, and from the non-existent `/usage`
   endpoint to the canonical `/quota` per `contracts/openapi.yaml`.
 
-### Notes
-
-- Public surface unchanged. `init`, `protect`, `track_llm`,
-  `track_tool`, `track_event` retain the same call signatures
-  documented in the existing examples. The platform's
-  `docs/sdk/README.md` describes an alternative 7-symbol surface
-  (with `wrap` alias and a different `init(organization_id, ...)`
-  signature) — that doc is out of sync with the SDK; an update
-  to the platform docs is tracked separately. Per the production
-  plan's user decisions, the SDK's surface is the source of truth.
-
-## [Unreleased]
-
-### Added
-
-- **Async Policy Cache**: `AsyncTransport` now uses `PolicyCache` for CACHED fallback mode. Previously the async transport always fell back to PERMISSIVE when gateway was unreachable. Now it caches successful execute decisions and uses them when gateway is unavailable.
-- **Custom Sensitive Tools API**: Added `add_sensitive_tool()`, `remove_sensitive_tool()`, `register_sensitive_tools()`, and `get_sensitive_tools()` methods to `NullRunRuntime`. Users can now register custom tools as sensitive requiring strict mode enforcement.
-- **`NullRunBlockedException.tool_name` attribute** (FIX-5): The `tool_name`
-  kwarg is now a first-class attribute on `NullRunBlockedException`
-  (and its subclasses `LoopDetectedException`, etc.) instead of being
-  absorbed into `**details`. Cookbook examples that read `exc.tool_name`
-  no longer raise `AttributeError`. Backwards-compatible: `tool_name`
-  defaults to `None` and does not appear in `exc.details` when unset.
-  The stringified exception now includes `tool={name}` when set.
-
-### Fixed
+- **P0-1 (PCI-DSS / GDPR): positional PII masking.** Sensitive tools
+  called positionally (e.g. ``charge("4111-1111-1111-1111", 50)``) now
+  mask positional args the same way kwargs already do, by introspecting
+  the function signature with ``inspect.signature(fn)`` and applying
+  ``SENSITIVE_ARG_KEYS`` to the matching parameter name. Pre-fix the
+  PAN at position 0 was forwarded as-is into ``/execute`` and landed
+  in the audit log.
+
+- **P0-3 (OOM): streaming response memory cap.** Sync and async
+  httpx transports now use bounded chunked reads capped at
+  ``MAX_RESPONSE_BYTES`` (16 MiB by default; ``NULLRUN_MAX_RESPONSE_BYTES``
+  env var to override). When the cap is exceeded, tracking is skipped
+  and ``_coverage_streaming_skipped`` is incremented so the dashboard
+  sees which hosts are producing oversized responses. Pre-fix
+  ``response.read()`` / ``await response.aread()`` buffered the entire
+  response body in memory — a 16+ MB allocation per streaming LLM
+  call under load.
+
+- **P0-4 (cost-audit): drop-newest on buffer overflow.** The CB-OPEN
+  re-queue path in ``Transport._do_flush_locked`` now drops the
+  NEWEST non-critical events instead of the oldest. The oldest
+  events (start-of-incident, start-of-billing-period) are exactly
+  what a billing investigator needs to reconstruct — losing them
+  silently broke monthly rollups. Control-plane events
+  (``state_change`` / ``kill_received`` / ``policy_invalidated`` /
+  ``key_rotated``) are preserved regardless of position so the
+  dashboard's KILL switch continues to land even under sustained
+  backend outage.
+
+- **P0-6 + P3-3 (security): redact-before-truncate.** ``_safe_repr``
+  now runs ``_strip_details_balanced`` on the FULL repr before
+  truncating to ``max_len=50``. Pre-fix the truncate ran first, and
+  if ``details={...}`` lived past position 50 in the original repr
+  (common for httpx.HTTPError with a long URL), the redact pass
+  saw nothing on the truncated slice and the raw payload leaked
+  into ``span_end`` audit events.
+
+- **S-8 / P2-4: ``agent_id`` is now a real UUID with dashes.**
+  ``agent()`` context manager emits ``str(uuid.uuid4())`` (e.g.
+  ``95ca7c0b-8334-478a-af23-2788803ef3b8``) for auto-generated ids.
+  Pre-fix the format was ``f"agent-{uuid.uuid4().hex}"`` — 32 hex
+  chars with no dashes; backend UUID-typed columns silently
+  dropped these to NULL on insert. User-supplied names are still
+  preserved verbatim.
+
+- **S-9: LRU cap on ``NullRunCallback._active_runs``** (4096 entries,
+  FIFO eviction with WARN log). Pre-fix this dict grew unbounded
+  when ``on_chain_end`` did not fire (errors in the chain body
+  short-circuited the end hook for some LangChain versions),
+  leaking memory in long-running services.
+
+- **S-10: WebSocket reconnect max-attempts cap** (10 consecutive
+  failures). Pre-fix the loop was unbounded (``while not
+  self._closed:``) and leaked the WS thread forever when the backend
+  was permanently down. After the cap the SDK falls back to
+  HTTP-poll for control-plane state delivery.
+
+- **P2-1: ``_coverage_seen`` now bumps in the httpx path.**
+  Pre-fix the counter was only incremented in the ``requests``
+  path (``auto_requests.py:185``), so the dashboard's coverage
+  view was empty for the dominant httpx traffic (every OpenAI /
+  Anthropic / Gemini / Mistral / Cohere call). Now both sync and
+  async httpx ``_emit`` bump the counter.
+
+- **P3-2: webhook delivery uses exponential backoff** (cap 30s).
+  Pre-fix the schedule was linear (``0.5 * (attempt + 1)``); under
+  sustained outage this produced a tight retry storm on the dead
+  endpoint — each KILL/PAUSE spawned its own delivery thread.
+  Post-fix the schedule is ``0.5 * 2**attempt`` capped at 30s:
+  0.5s, 1.0s, 2.0s, 4.0s, 8.0s, 16.0s, 30.0s.
+
+### Tests
+
+Added regression tests for every item above (57 new tests across 9
+new test files: ``test_agent_id_uuid.py``, ``test_args_pii_masked.py``,
+``test_streaming_oom_cap.py``, ``test_lru_active_runs.py``,
+``test_reconnect_cap.py``, ``test_coverage_seen_httpx.py``,
+``test_webhook_backoff.py``, ``test_redact.py``; existing
+``test_buffer_invariants.py`` extended with drop-newest + critical-event
+preservation cases).
+
+### Legacy
 
 - **SDK silent runtime fallback removed** (FIX-4): `_get_or_create_runtime`
   in `nullrun.decorators` no longer wraps `NullRunRuntime.get_instance()`
@@ -277,6 +360,17 @@ surface is unchanged. Aligns the SDK with the contracts in
   the SDK has no local mode: a missing API key is a hard error, not a
   silent allow-all.
 
+### Notes
+
+- Public surface unchanged. `init`, `protect`, `track_llm`,
+  `track_tool`, `track_event` retain the same call signatures
+  documented in the existing examples. The platform's
+  `docs/sdk/README.md` describes an alternative 7-symbol surface
+  (with `wrap` alias and a different `init(organization_id, ...)`
+  signature) — that doc is out of sync with the SDK; an update
+  to the platform docs is tracked separately. Per the production
+  plan's user decisions, the SDK's surface is the source of truth.
+
 ---
 
 ## [0.4.0] — 2026-06-17
@@ -496,6 +590,6 @@ _No breaking changes yet. Watch this file._
 
 ---
 
-[Unreleased]: https://github.com/maltsev-dev/nullrun-sdk/compare/v0.1.1...HEAD
+[0.5.2]: https://github.com/maltsev-dev/nullrun-sdk/compare/v0.4.0...v0.5.2
 [0.1.1]: https://github.com/maltsev-dev/nullrun-sdk/releases/tag/v0.1.1
 [0.1.0]: https://github.com/maltsev-dev/nullrun-sdk/releases/tag/v0.1.0
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "nullrun"
-version = "0.4.0"
+version = "0.5.2"
 description = "NullRun Python SDK — Enforcement gateway for AI agents."
 readme = "README.md"
 license = { text = "Apache-2.0" }

diff --git a/src/nullrun/__version__.py b/src/nullrun/__version__.py
@@ -1,4 +1,4 @@
 """NullRun Platform SDK."""
 
-__version__ = "0.4.0"
+__version__ = "0.5.2"
 __platform_version__ = "1.0.0"
diff --git a/src/nullrun/actions.py b/src/nullrun/actions.py
@@ -372,6 +372,20 @@ def _deliver_webhook(self, webhook: WebhookConfig, payload: dict[str, Any]) -> N
             logger.warning("httpx not installed, cannot send webhook")
             return
 
+        # P3-2 (plan §10): exponential backoff between attempts with a
+        # 30s cap. Pre-fix the schedule was linear (``0.5 * (attempt+1)``
+        # → 0.5s, 1.0s, 1.5s, ...). Linear doesn't back off fast enough
+        # when the destination is down — a transient outage produced
+        # 100+ retries in seconds, and each KILL/PAUSE from the server
+        # spawns its own delivery thread, so 1000 events/min generated
+        # 1000 spinning daemon threads hammering the dead endpoint.
+        #
+        # Schedule: 0.5s, 1.0s, 2.0s, 4.0s, 8.0s, 16.0s, 30.0s (capped).
+        # Total worst-case wait over 7 retries is ~62s — long enough to
+        # ride out a brief blip, short enough that one stuck thread
+        # doesn't block forever.
+        _BACKOFF_BASE = 0.5
+        _BACKOFF_CAP = 30.0
         for attempt in range(webhook.retries):
             try:
                 response = httpx.post(
@@ -386,7 +400,8 @@ def _deliver_webhook(self, webhook: WebhookConfig, payload: dict[str, Any]) -> N
             except Exception as e:
                 logger.warning(f"Webhook attempt {attempt + 1} failed: {e}")
                 if attempt < webhook.retries - 1:
-                    time.sleep(0.5 * (attempt + 1))
+                    delay = min(_BACKOFF_BASE * (2 ** attempt), _BACKOFF_CAP)
+                    time.sleep(delay)
 
     def stop_webhooks(self) -> None:
         """Stop webhook delivery thread."""

diff --git a/src/nullrun/breaker/circuit_breaker.py b/src/nullrun/breaker/circuit_breaker.py
@@ -251,50 +251,76 @@ def state(self) -> CBState:
             return self._state
 
     def call(self, func: Callable[..., Any], *args, **kwargs) -> Any:
-        """Execute func through circuit breaker. Supports both sync and async functions."""
-
+        """Execute func through circuit breaker. Supports both sync and async functions.
+
+        §7.2 #35: the pre-fix code did the OPEN→HALF_OPEN jitter
+        via ``time.sleep`` here, BEFORE dispatching to
+        ``_call_sync`` / ``_call_async``. That meant an async
+        caller invoking ``breaker.call(async_func, ...)`` from
+        inside an event loop would block that loop on a sync
+        sleep — turning every HALF_OPEN probe into a 0–5 second
+        stall of the entire coroutine scheduler. The fix decides
+        here whether jitter is needed and lets the dispatch path
+        use ``time.sleep`` for sync callers and ``asyncio.sleep``
+        for async ones.
+        """
         # Check global Redis state first - reject if another instance has it open
         if not self._global_state_allows_call():
             raise BreakerTransportError(
                 f"Circuit breaker OPEN (global) -- service unavailable. "
                 f"Retry in {self._recovery_timeout:.0f}s"
             )
 
-        # Add jitter before transitioning from OPEN to HALF_OPEN to prevent thundering herd
+        # Decide whether jitter is needed; the actual sleep happens
+        # in the dispatch path so it can be ``time.sleep`` for sync
+        # callers and ``asyncio.sleep`` for async ones.
+        needs_open_jitter = (
+            self._state == CBState.OPEN
+            and self._opened_at is not None
+            and (time.monotonic() - self._opened_at) >= self._recovery_timeout
+        )
+
+        # Check if func is a coroutine function (async) before
+        # grabbing any locks — async callers need an awaitable.
+        import inspect
+        if inspect.iscoroutinefunction(func):
+            return self._call_async(func, needs_open_jitter, *args, **kwargs)
+        return self._call_sync(func, needs_open_jitter, *args, **kwargs)
+
+    def _maybe_apply_open_jitter_sync(self) -> None:
+        """Sync version of the OPEN→HALF_OPEN jitter. See §7.2 #35."""
         if self._state == CBState.OPEN and self._opened_at is not None:
             time_in_open = time.monotonic() - self._opened_at
             if time_in_open >= self._recovery_timeout:
-                # Add random jitter (0-30 seconds) to prevent thundering herd
-                # Phase 8: cap at 5s (was 30s). The previous value
-                # blocked the caller's thread for up to 30s on
-                # every OPEN->HALF_OPEN transition. 5s is plenty
-                # to spread reconnects across workers.
+                # Phase 8: cap at 5s (was 30s). 5s is plenty to
+                # spread reconnects across workers.
                 jitter = random.uniform(0, 5.0)
                 time.sleep(jitter)
 
-        state = self.state
+    async def _maybe_apply_open_jitter_async(self) -> None:
+        """Async version of the OPEN→HALF_OPEN jitter. Awaits
+        instead of blocking the event loop. See §7.2 #35."""
+        if self._state == CBState.OPEN and self._opened_at is not None:
+            time_in_open = time.monotonic() - self._opened_at
+            if time_in_open >= self._recovery_timeout:
+                jitter = random.uniform(0, 5.0)
+                await asyncio.sleep(jitter)
 
+    def _call_sync(self, func: Callable[..., Any], needs_open_jitter: bool, *args, **kwargs) -> Any:
+        """Execute sync func through circuit breaker."""
+        if needs_open_jitter:
+            self._maybe_apply_open_jitter_sync()
+        state = self.state
         if state == CBState.OPEN:
             raise BreakerTransportError(
                 f"Circuit breaker OPEN -- service unavailable. "
                 f"Retry in {self._recovery_timeout:.0f}s"
             )
-
         if state == CBState.HALF_OPEN:
             with self._lock:
                 if self._half_open_calls >= self._half_open_max_calls:
                     raise BreakerTransportError("Circuit breaker HALF_OPEN -- waiting")
                 self._half_open_calls += 1
-
-        # Check if func is a coroutine function (async)
-        import inspect
-        if inspect.iscoroutinefunction(func):
-            return self._call_async(func, *args, **kwargs)
-        else:
-            return self._call_sync(func, *args, **kwargs)
-
-    def _call_sync(self, func: Callable[..., Any], *args, **kwargs) -> Any:
-        """Execute sync func through circuit breaker."""
         try:
             result = func(*args, **kwargs)
             self._on_success()
@@ -303,8 +329,21 @@ def _call_sync(self, func: Callable[..., Any], *args, **kwargs) -> Any:
             self._on_failure()
             raise
 
-    async def _call_async(self, func: Callable[..., Any], *args, **kwargs) -> Any:
+    async def _call_async(self, func: Callable[..., Any], needs_open_jitter: bool, *args, **kwargs) -> Any:
         """Execute async func through circuit breaker."""
+        if needs_open_jitter:
+            await self._maybe_apply_open_jitter_async()
+        state = self.state
+        if state == CBState.OPEN:
+            raise BreakerTransportError(
+                f"Circuit breaker OPEN -- service unavailable. "
+                f"Retry in {self._recovery_timeout:.0f}s"
+            )
+        if state == CBState.HALF_OPEN:
+            with self._lock:
+                if self._half_open_calls >= self._half_open_max_calls:
+                    raise BreakerTransportError("Circuit breaker HALF_OPEN -- waiting")
+                self._half_open_calls += 1
         try:
             result = await func(*args, **kwargs)
             await self._on_success_async()

diff --git a/src/nullrun/context.py b/src/nullrun/context.py
@@ -111,17 +111,29 @@ def workflow(name: str | None = None) -> Generator[str, None, None]:
     # was inconsistent with the rest of the SDK's id generation.
     workflow_id = name or str(uuid.uuid4())
     trace_id = generate_trace_id()
+    # §7.2 #16: a new workflow gets a fresh span_id too. The
+    # pre-fix code only reset workflow_id and trace_id, so a
+    # ``with span("inner"); with workflow("outer")`` block would
+    # leave the inner span_id visible inside the workflow scope —
+    # the span emitted by the workflow would carry the wrong
+    # parent. We set a new span_id here so the audit log can
+    # correctly nest the workflow's own span_start under the
+    # workflow_id (rather than under some earlier span that
+    # happened to be on the contextvar stack).
+    span_id = generate_span_id()
 
     # Save current values
     wf_token = _workflow_id_var.set(workflow_id)
     trace_token = _trace_id_var.set(trace_id)
+    span_token = _span_id_var.set(span_id)
 
     try:
         yield workflow_id
     finally:
         # Restore previous values
         _workflow_id_var.reset(wf_token)
         _trace_id_var.reset(trace_token)
+        _span_id_var.reset(span_token)
 
 
 @contextmanager
@@ -168,7 +180,15 @@ def agent(name: str | None = None) -> Generator[str, None, None]:
     Yields:
         The agent_id string
     """
-    agent_id = name or f"agent-{uuid.uuid4().hex}"
+    # P2-4 / S-8: emit a real UUID4 with dashes (matching
+    # ``generate_trace_id`` / ``generate_span_id``). The previous
+    # ``f"agent-{uuid.uuid4().hex}"`` format was 32 hex chars
+    # without dashes; backend UUID-typed columns (cost_events.
+    # agent_id, audit_log) silently dropped these to NULL on insert
+    # (``Uuid::parse_str(...).ok()`` returned None). User-supplied
+    # ``name`` is preserved verbatim so existing dashboards continue
+    # to work for already-allocated agent ids.
+    agent_id = name or str(uuid.uuid4())
     token = _agent_id_var.set(agent_id)
 
     try: