Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 122 additions & 28 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,14 @@ surface is unchanged. Aligns the SDK with the contracts in
description of an older design that did not match the
shipped SDK.

## [Unreleased]
## [0.5.2] — 2026-06-19

This release bundles the Sprint 2.5 production-readiness hardening
alongside the Phase 0 contract / lifecycle fixes. The two streams were
shipped as separate `[Unreleased]` sections during development; they
are merged here into a single canonical entry so release tooling that
scans for the `[Unreleased]` anchor picks up the complete change set
exactly once.

### Added (production-readiness hardening)

Expand Down Expand Up @@ -209,6 +216,26 @@ surface is unchanged. Aligns the SDK with the contracts in
fingerprint used by `track_event` (the existing
`_fingerprint_for` is for HTTP responses keyed on host+body+status).

- **Async Policy Cache**: `AsyncTransport` now uses `PolicyCache` for CACHED fallback mode. Previously the async transport always fell back to PERMISSIVE when gateway was unreachable. Now it caches successful execute decisions and uses them when gateway is unavailable.

- **Custom Sensitive Tools API**: Added `add_sensitive_tool()`, `remove_sensitive_tool()`, `register_sensitive_tools()`, and `get_sensitive_tools()` methods to `NullRunRuntime`. Users can now register custom tools as sensitive requiring strict mode enforcement.

- **`NullRunBlockedException.tool_name` attribute** (FIX-5): The `tool_name`
kwarg is now a first-class attribute on `NullRunBlockedException`
(and its subclasses `LoopDetectedException`, etc.) instead of being
absorbed into `**details`. Cookbook examples that read `exc.tool_name`
no longer raise `AttributeError`. Backwards-compatible: `tool_name`
defaults to `None` and does not appear in `exc.details` when unset.
The stringified exception now includes `tool={name}` when set.

- **`check_control_plane` is case-insensitive on the state value.**
SDK now normalises the state with `.lower()` before comparing to
`"paused"` / `"killed"`. Pre-fix a backend regression to UPPERCASE
(e.g. `"KILLED"` in `state_change`) would have silently failed the
match and let a killed workflow keep running. Backend already emits
PascalCase per the `as_pascal_case()` normaliser in
`handlers.rs:9258`; this is defensive per `analyze.md` §11.6.

### Removed (Phase 5)

- **Empty placeholder modules deleted.** `src/nullrun/flow/`,
Expand Down Expand Up @@ -238,32 +265,88 @@ surface is unchanged. Aligns the SDK with the contracts in
behalf) to `X-API-Key`, and from the non-existent `/usage`
endpoint to the canonical `/quota` per `contracts/openapi.yaml`.

### Notes

- Public surface unchanged. `init`, `protect`, `track_llm`,
`track_tool`, `track_event` retain the same call signatures
documented in the existing examples. The platform's
`docs/sdk/README.md` describes an alternative 7-symbol surface
(with `wrap` alias and a different `init(organization_id, ...)`
signature) — that doc is out of sync with the SDK; an update
to the platform docs is tracked separately. Per the production
plan's user decisions, the SDK's surface is the source of truth.

## [Unreleased]

### Added

- **Async Policy Cache**: `AsyncTransport` now uses `PolicyCache` for CACHED fallback mode. Previously the async transport always fell back to PERMISSIVE when gateway was unreachable. Now it caches successful execute decisions and uses them when gateway is unavailable.
- **Custom Sensitive Tools API**: Added `add_sensitive_tool()`, `remove_sensitive_tool()`, `register_sensitive_tools()`, and `get_sensitive_tools()` methods to `NullRunRuntime`. Users can now register custom tools as sensitive requiring strict mode enforcement.
- **`NullRunBlockedException.tool_name` attribute** (FIX-5): The `tool_name`
kwarg is now a first-class attribute on `NullRunBlockedException`
(and its subclasses `LoopDetectedException`, etc.) instead of being
absorbed into `**details`. Cookbook examples that read `exc.tool_name`
no longer raise `AttributeError`. Backwards-compatible: `tool_name`
defaults to `None` and does not appear in `exc.details` when unset.
The stringified exception now includes `tool={name}` when set.

### Fixed
- **P0-1 (PCI-DSS / GDPR): positional PII masking.** Sensitive tools
called positionally (e.g. ``charge("4111-1111-1111-1111", 50)``) now
mask positional args the same way kwargs already do, by introspecting
the function signature with ``inspect.signature(fn)`` and applying
``SENSITIVE_ARG_KEYS`` to the matching parameter name. Pre-fix the
PAN at position 0 was forwarded as-is into ``/execute`` and landed
in the audit log.

- **P0-3 (OOM): streaming response memory cap.** Sync and async
httpx transports now use bounded chunked reads capped at
``MAX_RESPONSE_BYTES`` (16 MiB by default; ``NULLRUN_MAX_RESPONSE_BYTES``
env var to override). When the cap is exceeded, tracking is skipped
and ``_coverage_streaming_skipped`` is incremented so the dashboard
sees which hosts are producing oversized responses. Pre-fix
``response.read()`` / ``await response.aread()`` buffered the entire
response body in memory — a 16+ MB allocation per streaming LLM
call under load.

- **P0-4 (cost-audit): drop-newest on buffer overflow.** The CB-OPEN
re-queue path in ``Transport._do_flush_locked`` now drops the
NEWEST non-critical events instead of the oldest. The oldest
events (start-of-incident, start-of-billing-period) are exactly
what a billing investigator needs to reconstruct — losing them
silently broke monthly rollups. Control-plane events
(``state_change`` / ``kill_received`` / ``policy_invalidated`` /
``key_rotated``) are preserved regardless of position so the
dashboard's KILL switch continues to land even under sustained
backend outage.

- **P0-6 + P3-3 (security): redact-before-truncate.** ``_safe_repr``
now runs ``_strip_details_balanced`` on the FULL repr before
truncating to ``max_len=50``. Pre-fix the truncate ran first, and
if ``details={...}`` lived past position 50 in the original repr
(common for httpx.HTTPError with a long URL), the redact pass
saw nothing on the truncated slice and the raw payload leaked
into ``span_end`` audit events.

- **S-8 / P2-4: ``agent_id`` is now a real UUID with dashes.**
``agent()`` context manager emits ``str(uuid.uuid4())`` (e.g.
``95ca7c0b-8334-478a-af23-2788803ef3b8``) for auto-generated ids.
Pre-fix the format was ``f"agent-{uuid.uuid4().hex}"`` — 32 hex
chars with no dashes; backend UUID-typed columns silently
dropped these to NULL on insert. User-supplied names are still
preserved verbatim.

- **S-9: LRU cap on ``NullRunCallback._active_runs``** (4096 entries,
FIFO eviction with WARN log). Pre-fix this dict grew unbounded
when ``on_chain_end`` did not fire (errors in the chain body
short-circuited the end hook for some LangChain versions),
leaking memory in long-running services.

- **S-10: WebSocket reconnect max-attempts cap** (10 consecutive
failures). Pre-fix the loop was unbounded (``while not
self._closed:``) and leaked the WS thread forever when the backend
was permanently down. After the cap the SDK falls back to
HTTP-poll for control-plane state delivery.

- **P2-1: ``_coverage_seen`` now bumps in the httpx path.**
Pre-fix the counter was only incremented in the ``requests``
path (``auto_requests.py:185``), so the dashboard's coverage
view was empty for the dominant httpx traffic (every OpenAI /
Anthropic / Gemini / Mistral / Cohere call). Now both sync and
async httpx ``_emit`` bump the counter.

- **P3-2: webhook delivery uses exponential backoff** (cap 30s).
Pre-fix the schedule was linear (``0.5 * (attempt + 1)``); under
sustained outage this produced a tight retry storm on the dead
endpoint — each KILL/PAUSE spawned its own delivery thread.
Post-fix the schedule is ``0.5 * 2**attempt`` capped at 30s:
0.5s, 1.0s, 2.0s, 4.0s, 8.0s, 16.0s, 30.0s.

### Tests

Added regression tests for every item above (57 new tests across 9
new test files: ``test_agent_id_uuid.py``, ``test_args_pii_masked.py``,
``test_streaming_oom_cap.py``, ``test_lru_active_runs.py``,
``test_reconnect_cap.py``, ``test_coverage_seen_httpx.py``,
``test_webhook_backoff.py``, ``test_redact.py``; existing
``test_buffer_invariants.py`` extended with drop-newest + critical-event
preservation cases).

### Legacy

- **SDK silent runtime fallback removed** (FIX-4): `_get_or_create_runtime`
in `nullrun.decorators` no longer wraps `NullRunRuntime.get_instance()`
Expand All @@ -277,6 +360,17 @@ surface is unchanged. Aligns the SDK with the contracts in
the SDK has no local mode: a missing API key is a hard error, not a
silent allow-all.

### Notes

- Public surface unchanged. `init`, `protect`, `track_llm`,
`track_tool`, `track_event` retain the same call signatures
documented in the existing examples. The platform's
`docs/sdk/README.md` describes an alternative 7-symbol surface
(with `wrap` alias and a different `init(organization_id, ...)`
signature) — that doc is out of sync with the SDK; an update
to the platform docs is tracked separately. Per the production
plan's user decisions, the SDK's surface is the source of truth.

---

## [0.4.0] — 2026-06-17
Expand Down Expand Up @@ -496,6 +590,6 @@ _No breaking changes yet. Watch this file._

---

[Unreleased]: https://github.com/maltsev-dev/nullrun-sdk/compare/v0.1.1...HEAD
[0.5.2]: https://github.com/maltsev-dev/nullrun-sdk/compare/v0.4.0...v0.5.2
[0.1.1]: https://github.com/maltsev-dev/nullrun-sdk/releases/tag/v0.1.1
[0.1.0]: https://github.com/maltsev-dev/nullrun-sdk/releases/tag/v0.1.0
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "nullrun"
version = "0.4.0"
version = "0.5.2"
description = "NullRun Python SDK — Enforcement gateway for AI agents."
readme = "README.md"
license = { text = "Apache-2.0" }
Expand Down
2 changes: 1 addition & 1 deletion src/nullrun/__version__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""NullRun Platform SDK."""

__version__ = "0.4.0"
__version__ = "0.5.2"
__platform_version__ = "1.0.0"
17 changes: 16 additions & 1 deletion src/nullrun/actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,20 @@ def _deliver_webhook(self, webhook: WebhookConfig, payload: dict[str, Any]) -> N
logger.warning("httpx not installed, cannot send webhook")
return

# P3-2 (plan §10): exponential backoff between attempts with a
# 30s cap. Pre-fix the schedule was linear (``0.5 * (attempt+1)``
# → 0.5s, 1.0s, 1.5s, ...). Linear doesn't back off fast enough
# when the destination is down — a transient outage produced
# 100+ retries in seconds, and each KILL/PAUSE from the server
# spawns its own delivery thread, so 1000 events/min generated
# 1000 spinning daemon threads hammering the dead endpoint.
#
# Schedule: 0.5s, 1.0s, 2.0s, 4.0s, 8.0s, 16.0s, 30.0s (capped).
# Total worst-case wait over 7 retries is ~62s — long enough to
# ride out a brief blip, short enough that one stuck thread
# doesn't block forever.
_BACKOFF_BASE = 0.5
_BACKOFF_CAP = 30.0
for attempt in range(webhook.retries):
try:
response = httpx.post(
Expand All @@ -386,7 +400,8 @@ def _deliver_webhook(self, webhook: WebhookConfig, payload: dict[str, Any]) -> N
except Exception as e:
logger.warning(f"Webhook attempt {attempt + 1} failed: {e}")
if attempt < webhook.retries - 1:
time.sleep(0.5 * (attempt + 1))
delay = min(_BACKOFF_BASE * (2 ** attempt), _BACKOFF_CAP)
time.sleep(delay)

def stop_webhooks(self) -> None:
"""Stop webhook delivery thread."""
Expand Down
81 changes: 60 additions & 21 deletions src/nullrun/breaker/circuit_breaker.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,50 +251,76 @@ def state(self) -> CBState:
return self._state

def call(self, func: Callable[..., Any], *args, **kwargs) -> Any:
"""Execute func through circuit breaker. Supports both sync and async functions."""

"""Execute func through circuit breaker. Supports both sync and async functions.

§7.2 #35: the pre-fix code did the OPEN→HALF_OPEN jitter
via ``time.sleep`` here, BEFORE dispatching to
``_call_sync`` / ``_call_async``. That meant an async
caller invoking ``breaker.call(async_func, ...)`` from
inside an event loop would block that loop on a sync
sleep — turning every HALF_OPEN probe into a 0–5 second
stall of the entire coroutine scheduler. The fix decides
here whether jitter is needed and lets the dispatch path
use ``time.sleep`` for sync callers and ``asyncio.sleep``
for async ones.
"""
# Check global Redis state first - reject if another instance has it open
if not self._global_state_allows_call():
raise BreakerTransportError(
f"Circuit breaker OPEN (global) -- service unavailable. "
f"Retry in {self._recovery_timeout:.0f}s"
)

# Add jitter before transitioning from OPEN to HALF_OPEN to prevent thundering herd
# Decide whether jitter is needed; the actual sleep happens
# in the dispatch path so it can be ``time.sleep`` for sync
# callers and ``asyncio.sleep`` for async ones.
needs_open_jitter = (
self._state == CBState.OPEN
and self._opened_at is not None
and (time.monotonic() - self._opened_at) >= self._recovery_timeout
)

# Check if func is a coroutine function (async) before
# grabbing any locks — async callers need an awaitable.
import inspect
if inspect.iscoroutinefunction(func):
return self._call_async(func, needs_open_jitter, *args, **kwargs)
return self._call_sync(func, needs_open_jitter, *args, **kwargs)

def _maybe_apply_open_jitter_sync(self) -> None:
"""Sync version of the OPEN→HALF_OPEN jitter. See §7.2 #35."""
if self._state == CBState.OPEN and self._opened_at is not None:
time_in_open = time.monotonic() - self._opened_at
if time_in_open >= self._recovery_timeout:
# Add random jitter (0-30 seconds) to prevent thundering herd
# Phase 8: cap at 5s (was 30s). The previous value
# blocked the caller's thread for up to 30s on
# every OPEN->HALF_OPEN transition. 5s is plenty
# to spread reconnects across workers.
# Phase 8: cap at 5s (was 30s). 5s is plenty to
# spread reconnects across workers.
jitter = random.uniform(0, 5.0)
time.sleep(jitter)

state = self.state
async def _maybe_apply_open_jitter_async(self) -> None:
"""Async version of the OPEN→HALF_OPEN jitter. Awaits
instead of blocking the event loop. See §7.2 #35."""
if self._state == CBState.OPEN and self._opened_at is not None:
time_in_open = time.monotonic() - self._opened_at
if time_in_open >= self._recovery_timeout:
jitter = random.uniform(0, 5.0)
await asyncio.sleep(jitter)

def _call_sync(self, func: Callable[..., Any], needs_open_jitter: bool, *args, **kwargs) -> Any:
"""Execute sync func through circuit breaker."""
if needs_open_jitter:
self._maybe_apply_open_jitter_sync()
state = self.state
if state == CBState.OPEN:
raise BreakerTransportError(
f"Circuit breaker OPEN -- service unavailable. "
f"Retry in {self._recovery_timeout:.0f}s"
)

if state == CBState.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self._half_open_max_calls:
raise BreakerTransportError("Circuit breaker HALF_OPEN -- waiting")
self._half_open_calls += 1

# Check if func is a coroutine function (async)
import inspect
if inspect.iscoroutinefunction(func):
return self._call_async(func, *args, **kwargs)
else:
return self._call_sync(func, *args, **kwargs)

def _call_sync(self, func: Callable[..., Any], *args, **kwargs) -> Any:
"""Execute sync func through circuit breaker."""
try:
result = func(*args, **kwargs)
self._on_success()
Expand All @@ -303,8 +329,21 @@ def _call_sync(self, func: Callable[..., Any], *args, **kwargs) -> Any:
self._on_failure()
raise

async def _call_async(self, func: Callable[..., Any], *args, **kwargs) -> Any:
async def _call_async(self, func: Callable[..., Any], needs_open_jitter: bool, *args, **kwargs) -> Any:
"""Execute async func through circuit breaker."""
if needs_open_jitter:
await self._maybe_apply_open_jitter_async()
state = self.state
if state == CBState.OPEN:
raise BreakerTransportError(
f"Circuit breaker OPEN -- service unavailable. "
f"Retry in {self._recovery_timeout:.0f}s"
)
if state == CBState.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self._half_open_max_calls:
raise BreakerTransportError("Circuit breaker HALF_OPEN -- waiting")
self._half_open_calls += 1
try:
result = await func(*args, **kwargs)
await self._on_success_async()
Expand Down
22 changes: 21 additions & 1 deletion src/nullrun/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,17 +111,29 @@ def workflow(name: str | None = None) -> Generator[str, None, None]:
# was inconsistent with the rest of the SDK's id generation.
workflow_id = name or str(uuid.uuid4())
trace_id = generate_trace_id()
# §7.2 #16: a new workflow gets a fresh span_id too. The
# pre-fix code only reset workflow_id and trace_id, so a
# ``with span("inner"); with workflow("outer")`` block would
# leave the inner span_id visible inside the workflow scope —
# the span emitted by the workflow would carry the wrong
# parent. We set a new span_id here so the audit log can
# correctly nest the workflow's own span_start under the
# workflow_id (rather than under some earlier span that
# happened to be on the contextvar stack).
span_id = generate_span_id()

# Save current values
wf_token = _workflow_id_var.set(workflow_id)
trace_token = _trace_id_var.set(trace_id)
span_token = _span_id_var.set(span_id)

try:
yield workflow_id
finally:
# Restore previous values
_workflow_id_var.reset(wf_token)
_trace_id_var.reset(trace_token)
_span_id_var.reset(span_token)


@contextmanager
Expand Down Expand Up @@ -168,7 +180,15 @@ def agent(name: str | None = None) -> Generator[str, None, None]:
Yields:
The agent_id string
"""
agent_id = name or f"agent-{uuid.uuid4().hex}"
# P2-4 / S-8: emit a real UUID4 with dashes (matching
# ``generate_trace_id`` / ``generate_span_id``). The previous
# ``f"agent-{uuid.uuid4().hex}"`` format was 32 hex chars
# without dashes; backend UUID-typed columns (cost_events.
# agent_id, audit_log) silently dropped these to NULL on insert
# (``Uuid::parse_str(...).ok()`` returned None). User-supplied
# ``name`` is preserved verbatim so existing dashboards continue
# to work for already-allocated agent ids.
agent_id = name or str(uuid.uuid4())
token = _agent_id_var.set(agent_id)

try:
Expand Down
Loading
Loading