PatterAI · nicolotognoni · May 12, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,14 @@
 
 ## 0.6.1 (2026-05-12)
 
+### Fixed — OpenAI Realtime firstMessage silently cancelled on prewarm-adopted sessions
+
+With `agent.prewarm=true` (default) the OpenAI Realtime WebSocket is opened, primed (`session.created` → `session.update` → `session.updated`), and parked during the carrier ringing window; `StreamHandler` then adopts it at call pickup with `source=adopted ms=0`. The audio bridge between the Twilio/Telnyx stream and the upstream OpenAI session is therefore live the instant the callee answers. OpenAI's server-VAD treats any caller audio that arrives before the assistant's first audio frame as a barge-in and cancels the in-flight `response.create`. In practice the caller's "Hi" / "Hello?" reliably reaches OpenAI in the ~250-450 ms before the firstMessage audio starts streaming back — so the configured `first_message` was *silently cancelled* and the caller heard the agent respond to their hello instead of delivering the scripted opening. The cold-path `connect()` masked this because its WS handshake naturally buffered ~300 ms of caller silence.
+
+Fix: `send_first_message` (Py) / `sendFirstMessage` (TS) now arm a one-shot server-VAD lockout immediately before issuing `response.create` for the firstMessage turn. A `session.update` with `turn_detection: null` (an OpenAI-documented value that disables server-VAD entirely — no audio-driven response cancellation) is sent first, then the `response.create`. The receive loop / message listener watches for the firstMessage `response.done` and re-issues a `session.update` restoring the original `turn_detection` block (snapshotted from `vad_type` / `silence_duration_ms` / `threshold` / `prefix_padding_ms`) so barge-in works normally for every subsequent turn. The lockout is strictly one-shot: subsequent `response.done` events (e.g. from later turns) do not re-send the restore. Best-effort: failures on either send fall back to the pre-fix behaviour without breaking the call. Complements the client-side `_first_audio_sent_at` / `firstAudioSentAt` guard added in PR #92 — that one prevents the local audio bridge from clearing the playout buffer on caller speech, this one prevents the *server* from cancelling the response.
+
+Files: `libraries/python/getpatter/providers/openai_realtime.py`, `libraries/typescript/src/providers/openai-realtime.ts`. Coverage: `libraries/python/tests/unit/test_providers_io_unit.py` (3 new tests covering lockout sequence, custom `silence_duration_ms` restore, and one-shot semantics) + `libraries/typescript/tests/unit/openai-realtime.test.ts` (4 new tests covering the same behaviours plus the no-ws no-op).
+
 ### Changed — `StreamHandler` adopt-capability check now uses duck typing
 
 The TS realtime adopt branch in `stream-handler.ts` previously relied on `this.adapter instanceof OpenAIRealtimeAdapter` to gate the prewarm-handoff path. Switched to a duck-type check (`typeof adapter.adoptWebSocket === 'function'`) so the generic stream-handler module stays provider-agnostic on this hot path and matches the Python handler's `getattr(self._adapter, "adopt_websocket", None)` shape. Files: `libraries/typescript/src/stream-handler.ts`.

diff --git a/libraries/python/getpatter/providers/openai_realtime.py b/libraries/python/getpatter/providers/openai_realtime.py
@@ -164,6 +164,16 @@ def __init__(
         import time as _time
 
         self._session_start_monotonic: float = _time.monotonic()
+        # ``send_first_message`` arms a one-shot server-VAD lockout so the
+        # firstMessage turn cannot be interrupted by the caller's first audio
+        # frames (which happens reliably on prewarm-adopted sessions where the
+        # adopt+response.create races the caller's "hello?" by a few hundred
+        # ms). The flag is consumed inside ``receive_events`` when the
+        # firstMessage ``response.done`` arrives, at which point we re-issue
+        # ``session.update`` to restore the original ``turn_detection`` block
+        # captured here. See ``send_first_message`` for the full rationale.
+        self._first_message_protection_pending: bool = False
+        self._saved_turn_detection: dict | None = None
 
     def record_session_end(self) -> None:
         """Emit ``patter.cost.realtime_minutes`` for the elapsed session duration."""
@@ -603,6 +613,38 @@ async def _iter_raw():
                     self._current_response_item_id = None
                     self._current_response_audio_ms = 0
                     self._current_response_first_audio_at = None
+                    # If ``send_first_message`` armed the server-VAD lockout
+                    # for the firstMessage turn, this ``response.done``
+                    # signals the firstMessage finished streaming and it is
+                    # safe to restore the original ``turn_detection`` so
+                    # barge-in works for the rest of the call. Best-effort:
+                    # a failed send leaves the session without VAD, which
+                    # degrades barge-in but does not break the call — the
+                    # next ``session.update`` (e.g. on a tool turn) would
+                    # also rearm. See ``send_first_message`` for the
+                    # full rationale.
+                    if (
+                        self._first_message_protection_pending
+                        and self._saved_turn_detection is not None
+                    ):
+                        try:
+                            await self._ws.send(
+                                json.dumps(
+                                    {
+                                        "type": "session.update",
+                                        "session": {
+                                            "turn_detection": self._saved_turn_detection,
+                                        },
+                                    }
+                                )
+                            )
+                        except Exception as exc:  # noqa: BLE001
+                            logger.debug(
+                                "first_message: turn_detection restore failed: %s",
+                                exc,
+                            )
+                        self._first_message_protection_pending = False
+                        self._saved_turn_detection = None
                     yield ("response_done", data.get("response", {}))
 
                 elif event_type == "error":
@@ -704,9 +746,63 @@ async def send_first_message(self, text: str) -> None:
         producing role-confused openings (e.g. a receptionist agent
         responding "I'd like to schedule a haircut" because it took its own
         first_message as a customer cue).
+
+        Server-VAD lockout during firstMessage
+        -------------------------------------
+
+        OpenAI Realtime server-VAD treats any caller audio that arrives
+        before the assistant's first audio frame as a barge-in and cancels
+        the in-flight ``response.create``. On the prewarm-adopted path
+        (``source=adopted ms=0``) the WS→audio bridge opens immediately at
+        call pickup; the caller's "Hi" / "Hello?" reliably reaches OpenAI in
+        the ~250-450 ms before the firstMessage audio starts streaming back,
+        so the configured ``first_message`` is *silently cancelled* and the
+        caller hears the agent respond to their hello instead of delivering
+        the scripted opening.
+
+        Fix: send a ``session.update`` that sets ``turn_detection`` to
+        ``None`` (OpenAI-documented: disables server-VAD entirely, no
+        audio-driven response cancellation), then ``response.create`` the
+        firstMessage. ``receive_events`` re-arms ``turn_detection`` from the
+        saved snapshot the moment ``response.done`` arrives for the
+        firstMessage turn, restoring normal barge-in for every subsequent
+        turn. The complementary client-side guard (``_first_audio_sent_at``
+        in ``stream_handler``) already prevents the caller's outbound clear
+        from firing — this lockout closes the gap on the *server* side.
+
+        Best-effort: if ``session.update`` raises we still proceed with
+        ``response.create``. The fallback behaviour matches the pre-fix
+        state — a higher likelihood of first-message preemption — but never
+        worse, so the call still completes.
         """
         if self._ws is None:
             return
+        # Snapshot the original turn_detection block so ``receive_events``
+        # can restore it after the firstMessage ``response.done``. We build
+        # it from the same configured fields ``_build_session_config`` uses
+        # so the restore is byte-identical to the cold connect path.
+        self._saved_turn_detection = {
+            "type": self.vad_type,
+            "threshold": 0.5,
+            "prefix_padding_ms": 300,
+            "silence_duration_ms": self.silence_duration_ms,
+        }
+        self._first_message_protection_pending = True
+        try:
+            await self._ws.send(
+                json.dumps(
+                    {
+                        "type": "session.update",
+                        "session": {"turn_detection": None},
+                    }
+                )
+            )
+        except Exception as exc:  # noqa: BLE001 - best-effort lockout
+            logger.debug("send_first_message: turn_detection lockout failed: %s", exc)
+            # Clear protection state so receive_events doesn't restore a
+            # turn_detection we never actually disabled.
+            self._first_message_protection_pending = False
+            self._saved_turn_detection = None
         await self._ws.send(
             json.dumps(
                 {