Skip to content

fix(rtmg/web): bulk sends freeze all reads — chunked sends + write_audio recv hardening#245

Merged
leszko merged 2 commits into
ryanontheinside/feat/models/rt-inputfrom
gio/fix/rt-input-ws-read-starvation
Jun 11, 2026
Merged

fix(rtmg/web): bulk sends freeze all reads — chunked sends + write_audio recv hardening#245
leszko merged 2 commits into
ryanontheinside/feat/models/rt-inputfrom
gio/fix/rt-input-ws-read-starvation

Conversation

@gioelecerati

Copy link
Copy Markdown
Collaborator

Stacked on the rt-input branch (#235). Server half of the deadlock that made live sequencer splices dead end-to-end — client half is rtmg-vst#33. Both were running as hot-patches on the :dev pod (40431735) during today's debugging with @gioelecerati and validated live (paint → splice → write_audio_applied → ack → audible, ~1.5 s + emergence); this PR makes them durable before the next bake wipes them.

1. Chunked (fragmented) bulk sends

websockets-sync holds protocol_mutex across socket.sendall, and recv_events — the thread that reads every inbound frame — needs that same mutex. One 11 MB stem send froze all reads until the peer drained it; against a VST mid write_audio upload (whose own reads were gated behind its sends — ixwebsocket bug, fixed in rtmg-vst#33) the two sides deadlocked permanently: params dead, splices never received, keepalive killed the session (1011 via the CF tunnel).

Thread dump of the live wedge:

  • conn_handler: send_stem_payloadsocket.sendall holding protocol_mutex
  • recv_events: blocked acquiring protocol_mutex
  • keepalive: blocked acquiring protocol_mutex

Fix: stems, the post-swap source mirror, and slice frames go out as fragmented messages in ~256 KiB pieces (chunked_ws_send in audio_codec.py) — the mutex releases between fragments so reads always interleave. Fragmentation is invisible at the message layer; payload bytes are identical (verified with the web SDK and the VST's ixwebsocket).

Note: this half also applies to main — stem delivery freezes reads there too (e.g. against a swap upload in flight) — and is worth cherry-picking independently of rt-input.

2. write_audio payload read hardening

The binary payload was read with a bare blocking recv() and no type check:

  • an orphan write_audio header consumed the next JSON command as its payload (probe-reproduced: audio_write_failed: a bytes-like object is required, not 'str'),
  • a payload that never arrived blocked the recv loop forever, wedging the whole session.

The read now has a 10 s timeout and a bytes type check; both failure modes answer audio_write_failed and keep the session alive.

Remaining latency levers (not in this PR)

Working end-to-end latency is ~1.5 s transport + 2–5 s emergence. The bar ships as f32 (1.49 MB) — f16/s16/zstd would cut upload 2–10×; #240's near-playhead repatch attacks the emergence delay.

🤖 Generated with Claude Code

gioelecerati and others added 2 commits June 11, 2026 11:33
…io recv hardening

Two server-side halves of the rt-input deadlock (the client halves are
rtmg-vst#33), both reproduced and validated live against the :dev pod:

1. websockets-sync holds protocol_mutex across socket.sendall, and
   recv_events — the thread that reads EVERY inbound frame — needs the
   same mutex. A single 11 MB stem send therefore froze all reads until
   the peer drained it; against a VST mid write_audio upload (its own
   reads gated behind its sends, see rtmg-vst#33) the two sides wedged
   permanently: params dead, splices never received, keepalive killed
   the session (1011 via the tunnel). Thread dump of the live wedge:
   conn_handler in sendall holding protocol_mutex; recv_events and
   keepalive blocked acquiring it. Big payloads (stems, the post-swap
   source mirror, slice frames) now go out as fragmented messages in
   ~256 KiB pieces — the mutex releases between fragments so reads
   interleave and the cycle cannot form. Fragmentation is invisible at
   the message layer; payload bytes are identical.

2. write_audio's binary payload was read with a bare blocking recv and
   no type check: an orphan header consumed the NEXT JSON command as
   its payload (audio_write_failed: "a bytes-like object is required,
   not 'str'"), and a payload that never arrived blocked the recv loop
   forever — wedging the whole session. The read now has a 10 s timeout
   and a bytes type check; both failure modes answer audio_write_failed
   and keep the session alive.

The chunked-send half also applies to main (stem delivery freezes reads
there too, e.g. against a swap upload in flight) and is worth
cherry-picking independently of rt-input.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d reads

Review fixes for #245:

- The post-swap source mirror in _serialize_swap_ready was still a plain
  ws.send of a full-length f16 buffer (tens of MB) — the largest single
  payload on the wire and exactly the read-freezing sendall this PR
  exists to eliminate. It now goes through chunked_ws_send.
- The 10 s timeout + binary type check added for write_audio now covers
  set_timbre_source, set_structure_source, and the client-upload arm of
  swap_source via a shared _recv_binary_payload helper — same orphan-
  header wedge class, same graceful *_failed answer. The not-binary log
  includes a preview of the consumed frame so a dropped JSON command is
  traceable.
- The control-bus recv thunk accepts (and ignores) the timeout kwarg, so
  the TypeError fallback around recv_audio(timeout=10) is gone — it
  fired on every MCP-injected write_audio in production and could mask a
  genuine TypeError from inside ws.recv.
- chunked_ws_send: rename the _chunk param to chunk_size and annotate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@leszko leszko merged commit 0c13e3c into ryanontheinside/feat/models/rt-input Jun 11, 2026
@leszko leszko deleted the gio/fix/rt-input-ws-read-starvation branch June 11, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants