Skip to content

Stream rework: harden R2 recording + migrate relay node-media-server → Icecast-KH/Liquidsoap (audio-only) #164

@anneoneone

Description

@anneoneone

Goal

Make the live stream stable, testable, always-recorded, low-latency, and observable. This issue is the output of a full evaluation of the current node-media-server (NMS) setup against six priorities (ordered): 1) stability 2) server-side test instance 3) always-recorded to R2 4) low latency 5) live metrics 6) other. It is written to be executed cold — a future session should be able to pick it up without re-doing the research.

TL;DR verdict: the ingest + recording design is sound, but NMS is the weakest link for a stability-first audio platform, and priorities #1, #2, #3, #5 are effectively unmet today. Recommendation: keep the Rust→R2 recorder, harden it, and migrate the public relay to Icecast-KH + Liquidsoap (purpose-built for radio), parallel-run, never big-bang.


Decisions already locked (do NOT re-litigate)

  • Audio-only. No video now or planned. (Current ffmpeg pipeline is AAC audio, no video track.)
  • Keep the ingest leg (browser → WebSocket → Rust → ffmpeg) and keep the Rust→R2 recorder as the system of record in every option.
  • Target relay = Icecast-KH + Liquidsoap. AzuraCast was considered and rejected as default (heavy; its per-DJ-session recorder would regress requirement [UAF] Setup and connect cloud storage #3). SRS/MediaMTX only matter if video ever enters scope.
  • Migration is parallel-run behind a flag, soak-tested, then flip the ffmpeg push target. A risky big-bang migration is itself a reliability regression.

Current architecture (verified in source — backend/)

  • Ingest: src/stream_bridge.rs start_stream() spawns ffmpeg -f webm -i pipe:0 -c:a aac -b:a 192k -ar 48000 -ac 2 -f flv <rtmp>. RTMP target + key are hardcoded in src/config.rs (default_rtmp_url = rtmp://stream.moafunk.de/live, default_rtmp_stream_key = stream-io).
  • Recording: src/stream_bridge.rs write_chunk() (~line 167) tees the raw browser Opus/WebM to one unbounded local file (./data/recordings-temp/recording_<show>_<ts>.webm). On stop it uploads to R2 recordings/<show>/<ver>/raw.webm; an optional ffmpeg finalize remixes HQ tracks → final.mp3. Write failures are swallowed (warn! only, returns Ok) ~line 181. Recording is opt-in + client-coupled: started from the frontend flow (frontend/src/admin/pages/StreamPage.vue:20, flow/FlowOnAir.vue:193, flow/FlowWaiting.vue:194) via POST /api/recording/start — if the browser tab/WS drops, recording stops with it.
  • R2/S3 client: src/storage.rs (aws-sdk-s3, force_path_style, region auto). Recording upload is single-shot put_object; multipart exists only for admin manual uploads >100 MB (src/handlers/upload_recording_chunked.rs).
  • "Test": src/handlers/stream_test_ws.rs is a browser loopback (buffers ~10 s, echoes audio back to the broadcaster). It never touches the stream server / RTMP / real codec path. There is no server-side test path.
  • Live data: none. No listener count, bitrate, or stream health is fetched from NMS. Online/offline is inferred two ways: backend in-memory StreamState (lost on restart) and the public frontend polling the HLS manifest with a HEAD every 8 s (frontend/src/main.ts).
  • NMS is not in the repo (no compose/config/webhooks). Because the frontend pulls …/stream-io/index.m3u8, the deployed NMS must be on the legacy v2.x linev4 dropped HLS entirely.
  • Recording schema: recording_versions table (status raw→finalizing→finalized/failed, raw_key/markers_key/final_key) and shows.recording_key/recording_filename. See src/db.rs.

Scorecard (current setup)

# Requirement Rating Why
1 Stability / reliable Poor Single-maintainer Node process, unfixed OOM history, no supervision/failover, single-VPS SPOF.
2 Test before go-live Absent Browser loopback only; no server-side test mount/instance; one hardcoded prod key.
3 Always recorded → R2 Fragile Decoupled (good) but opt-in/client-coupled, single unbounded WebM (interrupted = unrecoverable), single-shot upload, silent write failures, no integrity/duration check, no alert if a show didn't record.
4 Low latency Good FLV path ~1–3 s; latency is not the problem.
5 Live metrics Poor Zero telemetry captured; no listener count at all.
6 Other Poor Hardcoded stream-io key = hijack risk, no rotation/per-broadcast auth; no alerting; TLS/mixed-content unaddressed.

Open questions to resolve FIRST (Phase 0)

  • Which NMS major version is deployed? On the Hetzner box: npm ls node-media-server / check the app's package.json. Quick proof it serves HLS (⇒ v2.x): curl -I https://stream.moafunk.de/live/stream-io/index.m3u8 → 200. Pin it; never npm install blindly (a fresh install pulls v4 = FLV-only = breaks index.m3u8).
  • Realistic peak concurrent-listener count? Tens vs thousands decides whether single-VPS-direct is fine or a CDN/second-relay tier is mandatory (the single VPS is the dominant reliability risk; no relay choice fixes it alone).
  • Confirm iOS reach requirement holds (Icecast MP3/AAC plays in a native <audio> tag, so this gets simpler, not harder).

Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)

This is independent of any relay decision and is the highest-leverage reliability work. Touch src/stream_bridge.rs, src/handlers/recording.rs, src/storage.rs.

  • Make recording automatic on go-live (not an opt-in frontend call). Start the recording tee whenever a stream starts; stop+upload whenever it ends. Decouple from the browser lifecycle so a dropped tab doesn't end the archive.
  • Replace the single WebM tee with the ffmpeg segment muxer → MPEG-TS: -f segment -segment_time <e.g.10> -segment_format mpegts -reset_timestamps 1 out_%05d.ts. A crash then loses ≤1 segment instead of the whole file. (Audio-only = every frame effectively a keyframe, no alignment caveat.) Concat-demux at finalize.
  • Switch the R2 upload from single-shot put_object to S3 multipart (resumable, per-part retry, abort-on-failure). Derive part size from expected max size (size / <10000, floor ~16 MiB). Add a lifecycle rule to auto-abort stale multipart uploads (R2 default 7 days). Respect R2's 1-write/sec/key limit (deterministic keys + abort-then-restart on retry).
  • R2 CHECKSUM LANDMINE (must-do): aws-sdk-s3 default checksums (CRC32) have broken R2 uploads in the field (early-2025). Pin a known-good SDK version, set an R2-supported algorithm EXPLICITLY (SHA-256 or CRC32C) at CreateMultipartUpload, and integration-test against a real R2 bucket. Do not assume CRC64-NVME works through the Rust SDK against R2 — validate.
  • Verify before delete: after upload, HEAD for size+checksum and ffprobe duration vs scheduled show length. (A recorder that died after 2 min still yields a valid small object that passes naive checks.) Keep raw .ts until the concatenated artifact is verified in R2.
  • Stop swallowing recording write failures (stream_bridge.rs ~line 181): surface disk-full / write errors (status flag + log + alert) instead of warn!+Ok.
  • Dead-man's-switch: per-show check "object present in R2 with plausible duration + bitrate" → Telegram alert (src/telegram_notify.rs) if a scheduled show did not produce a recording. Test the switch on a schedule.
  • Cleanup job for orphaned ./data/recordings-temp/ files older than ~1 day.

Acceptance: kill ffmpeg mid-show (kill -9) → all but the last segment survive, finalize produces a valid full-length file in R2, and a deliberately-skipped recording fires a Telegram alert. Integration test runs against a real R2 bucket and passes the checksum + duration checks.


Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")

Run on the Hetzner box in parallel with NMS, behind a flag. The staging mount is the server-side test instance.

  • Deploy Icecast-KH (not stock Xiph Icecast — stock froze ~3k listeners in benchmarks; KH ~30k). Containerized, on its own port.
  • Deploy Liquidsoap 2.4.4 (skip 2.4.3 — shared-encoder crash). Playout: fallback(track_sensitive=false, [live, playlist, single]) so dead air auto-fills. Apply blank.strip to the live branch ONLY; do NOT pair blank.strip with mksafe on the same source (documented CPU-spike/"catchup" breakups, liquidsoap #3439/#3474) — make the whole fallback safe by ending the list with a single.
  • Two mounts: /live (production) and /test (broadcaster-only preview), each with separate source credentials. This replaces the hardcoded stream-io key and gives per-namespace auth.
  • Change the producer: in src/stream_bridge.rs, make the ffmpeg output target configurable (env/config, not hardcoded) and add an Icecast output: … -vn -content_type audio/mpeg -f mp3 icecast://source:<pw>@host:port/<mount> (or -f opus/-f adts). Keep the recording tee branch independent of the push-target branch — a "go live" (test→live) switch must NOT restart/disturb the tee, or it punches a gap in the recording (threatens [UAF] Setup and connect cloud storage #3).
  • Codec: Opus primary mount (best quality/bitrate, browser-native) + an MP3 fallback mount for legacy players.
  • TLS / mixed content: the public site is HTTPS (GitHub Pages), so a plain http:// mount is blocked as mixed content. Terminate HTTPS at nginx/Caddy in front of Icecast. (This is mandatory, not optional.)
  • Broadcaster preview player pointed at /test (plain <audio src>), visible only to the broadcaster.

Acceptance: broadcaster can push to /test, hear themselves via the preview player, and curl https://<host>/status-json.xsl shows the listeners count move — all without touching /live.


Phase 3 — Cutover (parallel-run → flip)

  • Soak-test a full-length show on /test on a real phone over cellular while NMS still serves production.
  • Flip the ffmpeg push target test→live and switch the frontend player from HLS/FLV to the Icecast mount (frontend/src/config.ts, frontend/src/streamDetector.ts, frontend/src/main.ts — the HLS HEAD-poll status check becomes a /status-json.xsl / mount probe).
  • Keep NMS warm for fast rollback for a few shows.

Acceptance: a real show runs end-to-end on Icecast (desktop + iOS), recording lands in R2 verified, NMS can be re-enabled in <5 min if needed.


Phase 4 — Observability, alerting & SPOF

  • Listener/quality telemetry → Rust backend (priority [UAF] Create backend #5): poll Icecast GET /status-json.xsl → per-mount listeners (a genuine concurrent count, which HLS structurally can't give), audio_bitrate, samplerate, channels. Handle icestats.source being an object (1 mount) vs array (N mounts). ffprobe -print_format json -show_streams <mount> for codec/bitrate. Poll every 10–15 s and cache in the backend; surface to the admin SPA + a new /api/stream/metrics.
  • Prometheus + Grafana + Alertmanager + Blackbox. Drop in markuslindenberg/icecast_exporter (icecast_listeners on :9146) for history with near-zero code. Bind all admin/metrics ports to 127.0.0.1 — never expose publicly. Don't scrape per-client endpoints (cardinality blow-up).
  • Alerts: Blackbox probe_success == 0 for 5m on the public stream URL (relay-down, independent of server metrics) + absent(<listeners metric>) == 1. CRITICAL: do NOT use or on() vector(0) inside an alert rulevector(0) is always present and < 1, so it fires permanently; that trick is for Grafana panels only. Keep "stream-down" and "zero-listeners" as two separate alerts, or gate zero-listeners on a "show-scheduled" signal. Add recording-failure alerts: r2_upload_failures_total > 0, recording byte-rate stalls while stream up, ffmpeg/tee restarts.
  • Dead-air detection: byte/listener metrics stay nominal during silence — add a loudness/silence probe (Liquidsoap blank.strip already covers the audio path; consider an EBU-R128 tap for alerting).
  • Reliability hardening: run the relay under systemd Restart=always + hard memory cap (cgroup MemoryMax).
  • Kill the single-VPS SPOF: put Cloudflare/a CDN or a second relay in front of the public mount (higher-leverage than the relay choice itself). Scope this by the peak-listener answer from Phase 0.

Acceptance: admin dashboard shows live listener count + bitrate; stopping the stream fires exactly one "stream-down" alert (not a permanently-firing one); a missing recording fires an alert.


Gotchas / landmines (read before coding)

  • NMS v2→v4 HLS schism: a fresh npm install node-media-server pulls v4 (FLV-only) and breaks index.m3u8. Pin v2.x while NMS is still in the loop. (issue #669)
  • R2 + aws-sdk-s3 default checksums have broken uploads in the field — pin SDK + explicit algorithm + integration-test (see Phase 1).
  • blank.strip + mksafe on the same Liquidsoap source → CPU-spike breakups. Live branch only.
  • Go-live switch must not gap the recording — keep tee independent of push target.
  • HLS listener counts are inherently approximate (stateless segment GETs); Icecast's persistent connection gives a real count — one more reason to migrate.
  • Mixed content: http:// stream won't play from the HTTPS frontend — HTTPS-front the mount.
  • Latency tuning for audio: GOP/-keyint_min/-tune zerolatency are video-only no-ops for AAC/Opus; for audio the knobs are -hls_time/player live-edge (only relevant if you stay HLS-based — Icecast needs none).

References

Suggested decomposition (optional)

If this is too big for one PR, split into: (A) recording hardening [project::recording], (B) Icecast+Liquidsoap stack + test mount [project::Stream project::Infrastructure], (C) cutover + frontend player, (D) observability + SPOF [project::Infrastructure]. Phase 1 (A) is independently shippable and the highest priority.


Subtickets (milestone: Stream rework: recording hardening + Icecast migration)

▶ START HERE: ship #169 first — Phase 1 (recording hardening) is relay-independent and the highest-value, independently-shippable work; do all of #169#172 before touching the relay. The #168 spike can run in parallel and gates Phase 2 (#173) and Phase 4 SPOF scoping (#178). Execution order = issue-number order. Pick up work from the sub-issues, not this parent.

Phase 0 — discovery

Phase 1 — recording hardening (relay-independent, highest value, ship first)

Phase 2 — Icecast-KH + Liquidsoap stack + test mount

Phase 3 — cutover

Phase 4 — observability, alerting & SPOF

Metadata

Metadata

Assignees

No one assigned

    Labels

    project::InfrastructureArea: hosting, networking, server commsproject::StreamArea: live audio streaming & stream controlproject::recordingArea: show/voicemail recording & post-processingtype::backendLayer: Rust/Axum backend (handlers, DB, integrations)type::ciLayer: CI/CD, Docker, deploy & Python automation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions