You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the live stream stable, testable, always-recorded, low-latency, and observable. This issue is the output of a full evaluation of the current node-media-server (NMS) setup against six priorities (ordered): 1) stability 2) server-side test instance 3) always-recorded to R2 4) low latency 5) live metrics 6) other. It is written to be executed cold — a future session should be able to pick it up without re-doing the research.
TL;DR verdict: the ingest + recording design is sound, but NMS is the weakest link for a stability-first audio platform, and priorities #1, #2, #3, #5 are effectively unmet today. Recommendation: keep the Rust→R2 recorder, harden it, and migrate the public relay to Icecast-KH + Liquidsoap (purpose-built for radio), parallel-run, never big-bang.
Decisions already locked (do NOT re-litigate)
Audio-only. No video now or planned. (Current ffmpeg pipeline is AAC audio, no video track.)
Keep the ingest leg (browser → WebSocket → Rust → ffmpeg) and keep the Rust→R2 recorder as the system of record in every option.
Target relay = Icecast-KH + Liquidsoap. AzuraCast was considered and rejected as default (heavy; its per-DJ-session recorder would regress requirement [UAF] Setup and connect cloud storage #3). SRS/MediaMTX only matter if video ever enters scope.
Migration is parallel-run behind a flag, soak-tested, then flip the ffmpeg push target. A risky big-bang migration is itself a reliability regression.
Current architecture (verified in source — backend/)
Recording:src/stream_bridge.rswrite_chunk() (~line 167) tees the raw browser Opus/WebM to one unbounded local file (./data/recordings-temp/recording_<show>_<ts>.webm). On stop it uploads to R2 recordings/<show>/<ver>/raw.webm; an optional ffmpeg finalize remixes HQ tracks → final.mp3. Write failures are swallowed (warn! only, returns Ok) ~line 181. Recording is opt-in + client-coupled: started from the frontend flow (frontend/src/admin/pages/StreamPage.vue:20, flow/FlowOnAir.vue:193, flow/FlowWaiting.vue:194) via POST /api/recording/start — if the browser tab/WS drops, recording stops with it.
R2/S3 client:src/storage.rs (aws-sdk-s3, force_path_style, region auto). Recording upload is single-shot put_object; multipart exists only for admin manual uploads >100 MB (src/handlers/upload_recording_chunked.rs).
"Test":src/handlers/stream_test_ws.rs is a browser loopback (buffers ~10 s, echoes audio back to the broadcaster). It never touches the stream server / RTMP / real codec path. There is no server-side test path.
Live data:none. No listener count, bitrate, or stream health is fetched from NMS. Online/offline is inferred two ways: backend in-memory StreamState (lost on restart) and the public frontend polling the HLS manifest with a HEAD every 8 s (frontend/src/main.ts).
NMS is not in the repo (no compose/config/webhooks). Because the frontend pulls …/stream-io/index.m3u8, the deployed NMS must be on the legacy v2.x line — v4 dropped HLS entirely.
Recording schema:recording_versions table (status raw→finalizing→finalized/failed, raw_key/markers_key/final_key) and shows.recording_key/recording_filename. See src/db.rs.
Scorecard (current setup)
#
Requirement
Rating
Why
1
Stability / reliable
Poor
Single-maintainer Node process, unfixed OOM history, no supervision/failover, single-VPS SPOF.
2
Test before go-live
Absent
Browser loopback only; no server-side test mount/instance; one hardcoded prod key.
3
Always recorded → R2
Fragile
Decoupled (good) but opt-in/client-coupled, single unbounded WebM (interrupted = unrecoverable), single-shot upload, silent write failures, no integrity/duration check, no alert if a show didn't record.
4
Low latency
Good
FLV path ~1–3 s; latency is not the problem.
5
Live metrics
Poor
Zero telemetry captured; no listener count at all.
6
Other
Poor
Hardcoded stream-io key = hijack risk, no rotation/per-broadcast auth; no alerting; TLS/mixed-content unaddressed.
Open questions to resolve FIRST (Phase 0)
Which NMS major version is deployed? On the Hetzner box: npm ls node-media-server / check the app's package.json. Quick proof it serves HLS (⇒ v2.x): curl -I https://stream.moafunk.de/live/stream-io/index.m3u8 → 200. Pin it; never npm install blindly (a fresh install pulls v4 = FLV-only = breaks index.m3u8).
Realistic peak concurrent-listener count? Tens vs thousands decides whether single-VPS-direct is fine or a CDN/second-relay tier is mandatory (the single VPS is the dominant reliability risk; no relay choice fixes it alone).
Confirm iOS reach requirement holds (Icecast MP3/AAC plays in a native <audio> tag, so this gets simpler, not harder).
Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)
This is independent of any relay decision and is the highest-leverage reliability work. Touch src/stream_bridge.rs, src/handlers/recording.rs, src/storage.rs.
Make recording automatic on go-live (not an opt-in frontend call). Start the recording tee whenever a stream starts; stop+upload whenever it ends. Decouple from the browser lifecycle so a dropped tab doesn't end the archive.
Replace the single WebM tee with the ffmpeg segment muxer → MPEG-TS:-f segment -segment_time <e.g.10> -segment_format mpegts -reset_timestamps 1 out_%05d.ts. A crash then loses ≤1 segment instead of the whole file. (Audio-only = every frame effectively a keyframe, no alignment caveat.) Concat-demux at finalize.
Switch the R2 upload from single-shot put_object to S3 multipart (resumable, per-part retry, abort-on-failure). Derive part size from expected max size (size / <10000, floor ~16 MiB). Add a lifecycle rule to auto-abort stale multipart uploads (R2 default 7 days). Respect R2's 1-write/sec/key limit (deterministic keys + abort-then-restart on retry).
R2 CHECKSUM LANDMINE (must-do): aws-sdk-s3 default checksums (CRC32) have broken R2 uploads in the field (early-2025). Pin a known-good SDK version, set an R2-supported algorithm EXPLICITLY (SHA-256 or CRC32C) at CreateMultipartUpload, and integration-test against a real R2 bucket. Do not assume CRC64-NVME works through the Rust SDK against R2 — validate.
Verify before delete: after upload, HEAD for size+checksum andffprobe duration vs scheduled show length. (A recorder that died after 2 min still yields a valid small object that passes naive checks.) Keep raw .ts until the concatenated artifact is verified in R2.
Dead-man's-switch: per-show check "object present in R2 with plausible duration + bitrate" → Telegram alert (src/telegram_notify.rs) if a scheduled show did not produce a recording. Test the switch on a schedule.
Cleanup job for orphaned ./data/recordings-temp/ files older than ~1 day.
Acceptance: kill ffmpeg mid-show (kill -9) → all but the last segment survive, finalize produces a valid full-length file in R2, and a deliberately-skipped recording fires a Telegram alert. Integration test runs against a real R2 bucket and passes the checksum + duration checks.
Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")
Run on the Hetzner box in parallel with NMS, behind a flag. The staging mount is the server-side test instance.
Deploy Icecast-KH (not stock Xiph Icecast — stock froze ~3k listeners in benchmarks; KH ~30k). Containerized, on its own port.
Deploy Liquidsoap 2.4.4 (skip 2.4.3 — shared-encoder crash). Playout: fallback(track_sensitive=false, [live, playlist, single]) so dead air auto-fills. Apply blank.strip to the live branch ONLY; do NOT pair blank.strip with mksafe on the same source (documented CPU-spike/"catchup" breakups, liquidsoap #3439/#3474) — make the whole fallback safe by ending the list with a single.
Two mounts:/live (production) and /test (broadcaster-only preview), each with separate source credentials. This replaces the hardcoded stream-io key and gives per-namespace auth.
Change the producer: in src/stream_bridge.rs, make the ffmpeg output target configurable (env/config, not hardcoded) and add an Icecast output: … -vn -content_type audio/mpeg -f mp3 icecast://source:<pw>@host:port/<mount> (or -f opus/-f adts). Keep the recording tee branch independent of the push-target branch — a "go live" (test→live) switch must NOT restart/disturb the tee, or it punches a gap in the recording (threatens [UAF] Setup and connect cloud storage #3).
Codec: Opus primary mount (best quality/bitrate, browser-native) + an MP3 fallback mount for legacy players.
TLS / mixed content: the public site is HTTPS (GitHub Pages), so a plain http:// mount is blocked as mixed content. Terminate HTTPS at nginx/Caddy in front of Icecast. (This is mandatory, not optional.)
Broadcaster preview player pointed at /test (plain <audio src>), visible only to the broadcaster.
Acceptance: broadcaster can push to /test, hear themselves via the preview player, and curl https://<host>/status-json.xsl shows the listeners count move — all without touching /live.
Phase 3 — Cutover (parallel-run → flip)
Soak-test a full-length show on /test on a real phone over cellular while NMS still serves production.
Flip the ffmpeg push target test→live and switch the frontend player from HLS/FLV to the Icecast mount (frontend/src/config.ts, frontend/src/streamDetector.ts, frontend/src/main.ts — the HLS HEAD-poll status check becomes a /status-json.xsl / mount probe).
Keep NMS warm for fast rollback for a few shows.
Acceptance: a real show runs end-to-end on Icecast (desktop + iOS), recording lands in R2 verified, NMS can be re-enabled in <5 min if needed.
Phase 4 — Observability, alerting & SPOF
Listener/quality telemetry → Rust backend (priority [UAF] Create backend #5): poll Icecast GET /status-json.xsl → per-mount listeners (a genuine concurrent count, which HLS structurally can't give), audio_bitrate, samplerate, channels. Handle icestats.source being an object (1 mount) vs array (N mounts).ffprobe -print_format json -show_streams <mount> for codec/bitrate. Poll every 10–15 s and cache in the backend; surface to the admin SPA + a new /api/stream/metrics.
Prometheus + Grafana + Alertmanager + Blackbox. Drop in markuslindenberg/icecast_exporter (icecast_listeners on :9146) for history with near-zero code. Bind all admin/metrics ports to 127.0.0.1 — never expose publicly. Don't scrape per-client endpoints (cardinality blow-up).
Alerts: Blackbox probe_success == 0 for 5m on the public stream URL (relay-down, independent of server metrics) +absent(<listeners metric>) == 1. CRITICAL: do NOT use or on() vector(0) inside an alert rule — vector(0) is always present and < 1, so it fires permanently; that trick is for Grafana panels only. Keep "stream-down" and "zero-listeners" as two separate alerts, or gate zero-listeners on a "show-scheduled" signal. Add recording-failure alerts: r2_upload_failures_total > 0, recording byte-rate stalls while stream up, ffmpeg/tee restarts.
Dead-air detection: byte/listener metrics stay nominal during silence — add a loudness/silence probe (Liquidsoap blank.strip already covers the audio path; consider an EBU-R128 tap for alerting).
Reliability hardening: run the relay under systemd Restart=always + hard memory cap (cgroup MemoryMax).
Kill the single-VPS SPOF: put Cloudflare/a CDN or a second relay in front of the public mount (higher-leverage than the relay choice itself). Scope this by the peak-listener answer from Phase 0.
Acceptance: admin dashboard shows live listener count + bitrate; stopping the stream fires exactly one "stream-down" alert (not a permanently-firing one); a missing recording fires an alert.
Gotchas / landmines (read before coding)
NMS v2→v4 HLS schism: a fresh npm install node-media-server pulls v4 (FLV-only) and breaks index.m3u8. Pin v2.x while NMS is still in the loop. (issue #669)
R2 + aws-sdk-s3 default checksums have broken uploads in the field — pin SDK + explicit algorithm + integration-test (see Phase 1).
blank.strip + mksafe on the same Liquidsoap source → CPU-spike breakups. Live branch only.
Go-live switch must not gap the recording — keep tee independent of push target.
HLS listener counts are inherently approximate (stateless segment GETs); Icecast's persistent connection gives a real count — one more reason to migrate.
Mixed content:http:// stream won't play from the HTTPS frontend — HTTPS-front the mount.
Latency tuning for audio: GOP/-keyint_min/-tune zerolatency are video-only no-ops for AAC/Opus; for audio the knobs are -hls_time/player live-edge (only relevant if you stay HLS-based — Icecast needs none).
If this is too big for one PR, split into: (A) recording hardening [project::recording], (B) Icecast+Liquidsoap stack + test mount [project::Streamproject::Infrastructure], (C) cutover + frontend player, (D) observability + SPOF [project::Infrastructure]. Phase 1 (A) is independently shippable and the highest priority.
▶ START HERE: ship #169 first — Phase 1 (recording hardening) is relay-independent and the highest-value, independently-shippable work; do all of #169→#172 before touching the relay. The #168 spike can run in parallel and gates Phase 2 (#173) and Phase 4 SPOF scoping (#178). Execution order = issue-number order. Pick up work from the sub-issues, not this parent.
Goal
Make the live stream stable, testable, always-recorded, low-latency, and observable. This issue is the output of a full evaluation of the current
node-media-server(NMS) setup against six priorities (ordered): 1) stability 2) server-side test instance 3) always-recorded to R2 4) low latency 5) live metrics 6) other. It is written to be executed cold — a future session should be able to pick it up without re-doing the research.TL;DR verdict: the ingest + recording design is sound, but NMS is the weakest link for a stability-first audio platform, and priorities #1, #2, #3, #5 are effectively unmet today. Recommendation: keep the Rust→R2 recorder, harden it, and migrate the public relay to Icecast-KH + Liquidsoap (purpose-built for radio), parallel-run, never big-bang.
Decisions already locked (do NOT re-litigate)
Current architecture (verified in source —
backend/)src/stream_bridge.rsstart_stream()spawnsffmpeg -f webm -i pipe:0 -c:a aac -b:a 192k -ar 48000 -ac 2 -f flv <rtmp>. RTMP target + key are hardcoded insrc/config.rs(default_rtmp_url=rtmp://stream.moafunk.de/live,default_rtmp_stream_key=stream-io).src/stream_bridge.rswrite_chunk()(~line 167) tees the raw browser Opus/WebM to one unbounded local file (./data/recordings-temp/recording_<show>_<ts>.webm). On stop it uploads to R2recordings/<show>/<ver>/raw.webm; an optional ffmpegfinalizeremixes HQ tracks →final.mp3. Write failures are swallowed (warn!only, returnsOk) ~line 181. Recording is opt-in + client-coupled: started from the frontend flow (frontend/src/admin/pages/StreamPage.vue:20,flow/FlowOnAir.vue:193,flow/FlowWaiting.vue:194) viaPOST /api/recording/start— if the browser tab/WS drops, recording stops with it.src/storage.rs(aws-sdk-s3,force_path_style, regionauto). Recording upload is single-shotput_object; multipart exists only for admin manual uploads >100 MB (src/handlers/upload_recording_chunked.rs).src/handlers/stream_test_ws.rsis a browser loopback (buffers ~10 s, echoes audio back to the broadcaster). It never touches the stream server / RTMP / real codec path. There is no server-side test path.StreamState(lost on restart) and the public frontend polling the HLS manifest with a HEAD every 8 s (frontend/src/main.ts).…/stream-io/index.m3u8, the deployed NMS must be on the legacy v2.x line — v4 dropped HLS entirely.recording_versionstable (status raw→finalizing→finalized/failed,raw_key/markers_key/final_key) andshows.recording_key/recording_filename. Seesrc/db.rs.Scorecard (current setup)
stream-iokey = hijack risk, no rotation/per-broadcast auth; no alerting; TLS/mixed-content unaddressed.Open questions to resolve FIRST (Phase 0)
npm ls node-media-server/ check the app'spackage.json. Quick proof it serves HLS (⇒ v2.x):curl -I https://stream.moafunk.de/live/stream-io/index.m3u8→ 200. Pin it; nevernpm installblindly (a fresh install pulls v4 = FLV-only = breaksindex.m3u8).<audio>tag, so this gets simpler, not harder).Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)
This is independent of any relay decision and is the highest-leverage reliability work. Touch
src/stream_bridge.rs,src/handlers/recording.rs,src/storage.rs.-f segment -segment_time <e.g.10> -segment_format mpegts -reset_timestamps 1 out_%05d.ts. A crash then loses ≤1 segment instead of the whole file. (Audio-only = every frame effectively a keyframe, no alignment caveat.) Concat-demux at finalize.put_objectto S3 multipart (resumable, per-part retry, abort-on-failure). Derive part size from expected max size (size / <10000, floor ~16 MiB). Add a lifecycle rule to auto-abort stale multipart uploads (R2 default 7 days). Respect R2's 1-write/sec/key limit (deterministic keys + abort-then-restart on retry).CreateMultipartUpload, and integration-test against a real R2 bucket. Do not assume CRC64-NVME works through the Rust SDK against R2 — validate.HEADfor size+checksum andffprobeduration vs scheduled show length. (A recorder that died after 2 min still yields a valid small object that passes naive checks.) Keep raw.tsuntil the concatenated artifact is verified in R2.stream_bridge.rs~line 181): surface disk-full / write errors (status flag + log + alert) instead ofwarn!+Ok.src/telegram_notify.rs) if a scheduled show did not produce a recording. Test the switch on a schedule../data/recordings-temp/files older than ~1 day.Acceptance: kill
ffmpegmid-show (kill -9) → all but the last segment survive, finalize produces a valid full-length file in R2, and a deliberately-skipped recording fires a Telegram alert. Integration test runs against a real R2 bucket and passes the checksum + duration checks.Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")
Run on the Hetzner box in parallel with NMS, behind a flag. The staging mount is the server-side test instance.
fallback(track_sensitive=false, [live, playlist, single])so dead air auto-fills. Applyblank.stripto the live branch ONLY; do NOT pairblank.stripwithmksafeon the same source (documented CPU-spike/"catchup" breakups, liquidsoap #3439/#3474) — make the whole fallback safe by ending the list with asingle./live(production) and/test(broadcaster-only preview), each with separate source credentials. This replaces the hardcodedstream-iokey and gives per-namespace auth.src/stream_bridge.rs, make the ffmpeg output target configurable (env/config, not hardcoded) and add an Icecast output:… -vn -content_type audio/mpeg -f mp3 icecast://source:<pw>@host:port/<mount>(or-f opus/-f adts). Keep the recording tee branch independent of the push-target branch — a "go live" (test→live) switch must NOT restart/disturb the tee, or it punches a gap in the recording (threatens [UAF] Setup and connect cloud storage #3).http://mount is blocked as mixed content. Terminate HTTPS at nginx/Caddy in front of Icecast. (This is mandatory, not optional.)/test(plain<audio src>), visible only to the broadcaster.Acceptance: broadcaster can push to
/test, hear themselves via the preview player, andcurl https://<host>/status-json.xslshows thelistenerscount move — all without touching/live.Phase 3 — Cutover (parallel-run → flip)
/teston a real phone over cellular while NMS still serves production.frontend/src/config.ts,frontend/src/streamDetector.ts,frontend/src/main.ts— the HLS HEAD-poll status check becomes a/status-json.xsl/ mount probe).Acceptance: a real show runs end-to-end on Icecast (desktop + iOS), recording lands in R2 verified, NMS can be re-enabled in <5 min if needed.
Phase 4 — Observability, alerting & SPOF
GET /status-json.xsl→ per-mountlisteners(a genuine concurrent count, which HLS structurally can't give),audio_bitrate,samplerate,channels. Handleicestats.sourcebeing an object (1 mount) vs array (N mounts).ffprobe -print_format json -show_streams <mount>for codec/bitrate. Poll every 10–15 s and cache in the backend; surface to the admin SPA + a new/api/stream/metrics.markuslindenberg/icecast_exporter(icecast_listenerson :9146) for history with near-zero code. Bind all admin/metrics ports to 127.0.0.1 — never expose publicly. Don't scrape per-client endpoints (cardinality blow-up).probe_success == 0 for 5mon the public stream URL (relay-down, independent of server metrics) +absent(<listeners metric>) == 1. CRITICAL: do NOT useor on() vector(0)inside an alert rule —vector(0)is always present and< 1, so it fires permanently; that trick is for Grafana panels only. Keep "stream-down" and "zero-listeners" as two separate alerts, or gate zero-listeners on a "show-scheduled" signal. Add recording-failure alerts:r2_upload_failures_total > 0, recording byte-rate stalls while stream up, ffmpeg/tee restarts.blank.stripalready covers the audio path; consider an EBU-R128 tap for alerting).Restart=always+ hard memory cap (cgroupMemoryMax).Acceptance: admin dashboard shows live listener count + bitrate; stopping the stream fires exactly one "stream-down" alert (not a permanently-firing one); a missing recording fires an alert.
Gotchas / landmines (read before coding)
npm install node-media-serverpulls v4 (FLV-only) and breaksindex.m3u8. Pin v2.x while NMS is still in the loop. (issue #669)blank.strip+mksafeon the same Liquidsoap source → CPU-spike breakups. Live branch only.http://stream won't play from the HTTPS frontend — HTTPS-front the mount.-keyint_min/-tune zerolatencyare video-only no-ops for AAC/Opus; for audio the knobs are-hls_time/player live-edge (only relevant if you stay HLS-based — Icecast needs none).References
Suggested decomposition (optional)
If this is too big for one PR, split into: (A) recording hardening [
project::recording], (B) Icecast+Liquidsoap stack + test mount [project::Streamproject::Infrastructure], (C) cutover + frontend player, (D) observability + SPOF [project::Infrastructure]. Phase 1 (A) is independently shippable and the highest priority.Subtickets (milestone: Stream rework: recording hardening + Icecast migration)
Phase 0 — discovery
Phase 1 — recording hardening (relay-independent, highest value, ship first)
Phase 2 — Icecast-KH + Liquidsoap stack + test mount
Phase 3 — cutover
Phase 4 — observability, alerting & SPOF