Stream rework: harden R2 recording + migrate relay node-media-server → Icecast-KH/Liquidsoap (audio-only)

## Goal

Make the live stream **stable, testable, always-recorded, low-latency, and observable**. This issue is the output of a full evaluation of the current `node-media-server` (NMS) setup against six priorities (ordered): **1) stability  2) server-side test instance  3) always-recorded to R2  4) low latency  5) live metrics  6) other**. It is written to be **executed cold** — a future session should be able to pick it up without re-doing the research.

**TL;DR verdict:** the ingest + recording *design* is sound, but **NMS is the weakest link for a stability-first audio platform**, and priorities **#1, #2, #3, #5** are effectively unmet today. Recommendation: **keep the Rust→R2 recorder, harden it, and migrate the public relay to Icecast-KH + Liquidsoap** (purpose-built for radio), parallel-run, never big-bang.

---

## Decisions already locked (do NOT re-litigate)

- **Audio-only.** No video now or planned. (Current ffmpeg pipeline is AAC audio, no video track.)
- **Keep the ingest leg** (browser → WebSocket → Rust → ffmpeg) and **keep the Rust→R2 recorder as the system of record** in every option.
- **Target relay = Icecast-KH + Liquidsoap.** AzuraCast was considered and rejected as default (heavy; its per-DJ-session recorder would *regress* requirement #3). SRS/MediaMTX only matter if video ever enters scope.
- **Migration is parallel-run behind a flag, soak-tested, then flip the ffmpeg push target.** A risky big-bang migration is itself a reliability regression.

---

## Current architecture (verified in source — `backend/`)

- **Ingest:** `src/stream_bridge.rs` `start_stream()` spawns `ffmpeg -f webm -i pipe:0 -c:a aac -b:a 192k -ar 48000 -ac 2 -f flv <rtmp>`. RTMP target + key are **hardcoded** in `src/config.rs` (`default_rtmp_url` = `rtmp://stream.moafunk.de/live`, `default_rtmp_stream_key` = `stream-io`).
- **Recording:** `src/stream_bridge.rs` `write_chunk()` (~line 167) tees the **raw browser Opus/WebM** to **one unbounded local file** (`./data/recordings-temp/recording_<show>_<ts>.webm`). On stop it uploads to R2 `recordings/<show>/<ver>/raw.webm`; an optional ffmpeg `finalize` remixes HQ tracks → `final.mp3`. **Write failures are swallowed** (`warn!` only, returns `Ok`) ~line 181. Recording is **opt-in + client-coupled**: started from the frontend flow (`frontend/src/admin/pages/StreamPage.vue:20`, `flow/FlowOnAir.vue:193`, `flow/FlowWaiting.vue:194`) via `POST /api/recording/start` — if the browser tab/WS drops, recording stops with it.
- **R2/S3 client:** `src/storage.rs` (aws-sdk-s3, `force_path_style`, region `auto`). Recording upload is **single-shot `put_object`**; multipart exists only for admin manual uploads >100 MB (`src/handlers/upload_recording_chunked.rs`).
- **"Test":** `src/handlers/stream_test_ws.rs` is a **browser loopback** (buffers ~10 s, echoes audio back to the broadcaster). It **never touches the stream server / RTMP / real codec path**. There is **no server-side test path**.
- **Live data:** **none.** No listener count, bitrate, or stream health is fetched from NMS. Online/offline is inferred two ways: backend in-memory `StreamState` (lost on restart) and the **public frontend polling the HLS manifest with a HEAD every 8 s** (`frontend/src/main.ts`).
- **NMS is not in the repo** (no compose/config/webhooks). Because the frontend pulls `…/stream-io/index.m3u8`, the deployed NMS must be on the **legacy v2.x line** — **v4 dropped HLS entirely**.
- **Recording schema:** `recording_versions` table (status raw→finalizing→finalized/failed, `raw_key`/`markers_key`/`final_key`) and `shows.recording_key`/`recording_filename`. See `src/db.rs`.

---

## Scorecard (current setup)

| # | Requirement | Rating | Why |
|---|---|---|---|
| 1 | Stability / reliable | **Poor** | Single-maintainer Node process, unfixed OOM history, no supervision/failover, single-VPS SPOF. |
| 2 | Test before go-live | **Absent** | Browser loopback only; no server-side test mount/instance; one hardcoded prod key. |
| 3 | Always recorded → R2 | **Fragile** | Decoupled (good) but opt-in/client-coupled, single unbounded WebM (interrupted = unrecoverable), single-shot upload, silent write failures, no integrity/duration check, no alert if a show didn't record. |
| 4 | Low latency | **Good** | FLV path ~1–3 s; latency is not the problem. |
| 5 | Live metrics | **Poor** | Zero telemetry captured; no listener count at all. |
| 6 | Other | **Poor** | Hardcoded `stream-io` key = hijack risk, no rotation/per-broadcast auth; no alerting; TLS/mixed-content unaddressed. |

---

## Open questions to resolve FIRST (Phase 0)

- [ ] **Which NMS major version is deployed?** On the Hetzner box: `npm ls node-media-server` / check the app's `package.json`. Quick proof it serves HLS (⇒ v2.x): `curl -I https://stream.moafunk.de/live/stream-io/index.m3u8` → 200. **Pin it; never `npm install` blindly** (a fresh install pulls v4 = FLV-only = breaks `index.m3u8`).
- [ ] **Realistic peak concurrent-listener count?** Tens vs thousands decides whether single-VPS-direct is fine or a **CDN/second-relay tier is mandatory** (the single VPS is the dominant reliability risk; no relay choice fixes it alone).
- [ ] Confirm **iOS reach** requirement holds (Icecast MP3/AAC plays in a native `<audio>` tag, so this gets *simpler*, not harder).

---

## Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)

This is independent of any relay decision and is the highest-leverage reliability work. Touch `src/stream_bridge.rs`, `src/handlers/recording.rs`, `src/storage.rs`.

- [ ] **Make recording automatic on go-live** (not an opt-in frontend call). Start the recording tee whenever a stream starts; stop+upload whenever it ends. Decouple from the browser lifecycle so a dropped tab doesn't end the archive.
- [ ] **Replace the single WebM tee with the ffmpeg segment muxer → MPEG-TS:** `-f segment -segment_time <e.g.10> -segment_format mpegts -reset_timestamps 1 out_%05d.ts`. A crash then loses **≤1 segment** instead of the whole file. (Audio-only = every frame effectively a keyframe, no alignment caveat.) Concat-demux at finalize.
- [ ] **Switch the R2 upload from single-shot `put_object` to S3 multipart** (resumable, per-part retry, abort-on-failure). Derive part size from expected max size (`size / <10000`, floor ~16 MiB). Add a **lifecycle rule to auto-abort stale multipart uploads** (R2 default 7 days). Respect R2's **1-write/sec/key** limit (deterministic keys + abort-then-restart on retry).
- [ ] **R2 CHECKSUM LANDMINE (must-do):** aws-sdk-s3 **default checksums (CRC32) have broken R2 uploads in the field** (early-2025). **Pin a known-good SDK version, set an R2-supported algorithm EXPLICITLY (SHA-256 or CRC32C) at `CreateMultipartUpload`, and integration-test against a real R2 bucket.** Do not assume CRC64-NVME works through the Rust SDK against R2 — validate.
- [ ] **Verify before delete:** after upload, `HEAD` for size+checksum **and** `ffprobe` duration vs scheduled show length. (A recorder that died after 2 min still yields a valid small object that passes naive checks.) Keep raw `.ts` until the concatenated artifact is verified in R2.
- [ ] **Stop swallowing recording write failures** (`stream_bridge.rs` ~line 181): surface disk-full / write errors (status flag + log + alert) instead of `warn!`+`Ok`.
- [ ] **Dead-man's-switch:** per-show check "object present in R2 with plausible duration + bitrate" → **Telegram alert** (`src/telegram_notify.rs`) if a scheduled show did **not** produce a recording. Test the switch on a schedule.
- [ ] **Cleanup job** for orphaned `./data/recordings-temp/` files older than ~1 day.

**Acceptance:** kill `ffmpeg` mid-show (`kill -9`) → all but the last segment survive, finalize produces a valid full-length file in R2, and a deliberately-skipped recording fires a Telegram alert. Integration test runs against a real R2 bucket and passes the checksum + duration checks.

---

## Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")

Run on the Hetzner box **in parallel** with NMS, behind a flag. The staging mount **is** the server-side test instance.

- [ ] Deploy **Icecast-KH** (not stock Xiph Icecast — stock froze ~3k listeners in benchmarks; KH ~30k). Containerized, on its own port.
- [ ] Deploy **Liquidsoap 2.4.4** (skip 2.4.3 — shared-encoder crash). Playout: `fallback(track_sensitive=false, [live, playlist, single])` so dead air auto-fills. Apply `blank.strip` to the **live branch ONLY**; **do NOT pair `blank.strip` with `mksafe` on the same source** (documented CPU-spike/"catchup" breakups, liquidsoap #3439/#3474) — make the whole fallback safe by ending the list with a `single`.
- [ ] **Two mounts:** `/live` (production) and `/test` (broadcaster-only preview), each with **separate source credentials**. This replaces the hardcoded `stream-io` key and gives per-namespace auth.
- [ ] **Change the producer:** in `src/stream_bridge.rs`, make the ffmpeg output target configurable (env/config, not hardcoded) and add an Icecast output: `… -vn -content_type audio/mpeg -f mp3 icecast://source:<pw>@host:port/<mount>` (or `-f opus`/`-f adts`). **Keep the recording tee branch independent of the push-target branch** — a "go live" (test→live) switch must NOT restart/disturb the tee, or it punches a gap in the recording (threatens #3).
- [ ] **Codec:** Opus primary mount (best quality/bitrate, browser-native) + an **MP3 fallback mount** for legacy players.
- [ ] **TLS / mixed content:** the public site is HTTPS (GitHub Pages), so a plain `http://` mount is **blocked as mixed content**. Terminate HTTPS at **nginx/Caddy** in front of Icecast. (This is mandatory, not optional.)
- [ ] **Broadcaster preview player** pointed at `/test` (plain `<audio src>`), visible only to the broadcaster.

**Acceptance:** broadcaster can push to `/test`, hear themselves via the preview player, and `curl https://<host>/status-json.xsl` shows the `listeners` count move — all without touching `/live`.

---

## Phase 3 — Cutover (parallel-run → flip)

- [ ] **Soak-test** a full-length show on `/test` on a **real phone over cellular** while NMS still serves production.
- [ ] Flip the **ffmpeg push target** test→live and switch the **frontend player** from HLS/FLV to the Icecast mount (`frontend/src/config.ts`, `frontend/src/streamDetector.ts`, `frontend/src/main.ts` — the HLS HEAD-poll status check becomes a `/status-json.xsl` / mount probe).
- [ ] Keep **NMS warm** for fast rollback for a few shows.

**Acceptance:** a real show runs end-to-end on Icecast (desktop + iOS), recording lands in R2 verified, NMS can be re-enabled in <5 min if needed.

---

## Phase 4 — Observability, alerting & SPOF

- [ ] **Listener/quality telemetry → Rust backend** (priority #5): poll Icecast `GET /status-json.xsl` → per-mount `listeners` (a **genuine** concurrent count, which HLS structurally can't give), `audio_bitrate`, `samplerate`, `channels`. **Handle `icestats.source` being an object (1 mount) vs array (N mounts).** `ffprobe -print_format json -show_streams <mount>` for codec/bitrate. Poll every 10–15 s and **cache in the backend**; surface to the admin SPA + a new `/api/stream/metrics`.
- [ ] **Prometheus + Grafana + Alertmanager + Blackbox.** Drop in `markuslindenberg/icecast_exporter` (`icecast_listeners` on :9146) for history with near-zero code. **Bind all admin/metrics ports to 127.0.0.1** — never expose publicly. Don't scrape per-client endpoints (cardinality blow-up).
- [ ] **Alerts:** Blackbox `probe_success == 0 for 5m` on the public stream URL (relay-down, independent of server metrics) **+** `absent(<listeners metric>) == 1`. **CRITICAL: do NOT use `or on() vector(0)` inside an alert rule** — `vector(0)` is always present and `< 1`, so it fires permanently; that trick is for Grafana *panels* only. Keep "stream-down" and "zero-listeners" as **two separate alerts**, or gate zero-listeners on a "show-scheduled" signal. Add recording-failure alerts: `r2_upload_failures_total > 0`, recording byte-rate stalls while stream up, ffmpeg/tee restarts.
- [ ] **Dead-air detection:** byte/listener metrics stay nominal during silence — add a loudness/silence probe (Liquidsoap `blank.strip` already covers the audio path; consider an EBU-R128 tap for alerting).
- [ ] **Reliability hardening:** run the relay under **systemd `Restart=always` + hard memory cap** (cgroup `MemoryMax`).
- [ ] **Kill the single-VPS SPOF:** put **Cloudflare/a CDN or a second relay in front of the public mount** (higher-leverage than the relay choice itself). Scope this by the peak-listener answer from Phase 0.

**Acceptance:** admin dashboard shows live listener count + bitrate; stopping the stream fires exactly one "stream-down" alert (not a permanently-firing one); a missing recording fires an alert.

---

## Gotchas / landmines (read before coding)

- **NMS v2→v4 HLS schism:** a fresh `npm install node-media-server` pulls v4 (FLV-only) and **breaks `index.m3u8`**. Pin v2.x while NMS is still in the loop. ([issue #669](https://github.com/illuspas/Node-Media-Server/issues/669))
- **R2 + aws-sdk-s3 default checksums** have broken uploads in the field — pin SDK + explicit algorithm + integration-test (see Phase 1).
- **`blank.strip` + `mksafe`** on the same Liquidsoap source → CPU-spike breakups. Live branch only.
- **Go-live switch must not gap the recording** — keep tee independent of push target.
- **HLS listener counts are inherently approximate** (stateless segment GETs); Icecast's persistent connection gives a *real* count — one more reason to migrate.
- **Mixed content:** `http://` stream won't play from the HTTPS frontend — HTTPS-front the mount.
- **Latency tuning for audio:** GOP/`-keyint_min`/`-tune zerolatency` are **video-only no-ops** for AAC/Opus; for audio the knobs are `-hls_time`/player live-edge (only relevant if you stay HLS-based — Icecast needs none).

---

## References

- Icecast stats API: https://www.icecast.org/docs/icecast-trunk/server_stats/
- Liquidsoap: https://www.liquidsoap.info/
- icecast_exporter: https://github.com/markuslindenberg/icecast_exporter
- NMS HLS-removed-in-v4: https://github.com/illuspas/Node-Media-Server/issues/669
- Prometheus missing-series / alerting trap: https://promlabs.com/blog/2023/09/13/dealing-with-missing-time-series-in-prometheus/

## Suggested decomposition (optional)

If this is too big for one PR, split into: **(A) recording hardening** [`project::recording`], **(B) Icecast+Liquidsoap stack + test mount** [`project::Stream` `project::Infrastructure`], **(C) cutover + frontend player**, **(D) observability + SPOF** [`project::Infrastructure`]. Phase 1 (A) is independently shippable and the highest priority.


---

## Subtickets (milestone: *Stream rework: recording hardening + Icecast migration*)

> **▶ START HERE:** ship **#169** first — Phase 1 (recording hardening) is relay-independent and the highest-value, independently-shippable work; do all of #169→#172 before touching the relay. The **#168** spike can run in parallel and *gates* Phase 2 (#173) and Phase 4 SPOF scoping (#178). Execution order = issue-number order. Pick up work from the sub-issues, not this parent.

**Phase 0 — discovery**
- [ ] #168 — [spike] pin NMS version, peak concurrent listeners, iOS reach

**Phase 1 — recording hardening** (relay-independent, highest value, ship first)
- [ ] #169 — auto-record on go-live, decouple from browser, stop swallowing write failures
- [ ] #170 — crash-safe ffmpeg segment muxer (MPEG-TS) + concat finalize
- [ ] #171 — resumable R2 multipart upload + explicit checksum + verify-before-delete
- [ ] #172 — dead-man's-switch Telegram alert + temp-file cleanup job

**Phase 2 — Icecast-KH + Liquidsoap stack + test mount**
- [ ] #173 — Icecast-KH + Liquidsoap: /live + /test mounts, separate creds, TLS front
- [ ] #174 — configurable ffmpeg push target + Icecast output, tee kept independent
- [ ] #175 — broadcaster preview player pointed at /test

**Phase 3 — cutover**
- [ ] #176 — soak-test, flip push target test→live, switch public player HLS/FLV→Icecast

**Phase 4 — observability, alerting & SPOF**
- [ ] #177 — listener/quality telemetry → backend + /api/stream/metrics + admin display
- [ ] #178 — Prometheus/Grafana/Alertmanager/Blackbox + alerts + systemd/SPOF hardening



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream rework: harden R2 recording + migrate relay node-media-server → Icecast-KH/Liquidsoap (audio-only) #164

Goal

Decisions already locked (do NOT re-litigate)

Current architecture (verified in source — `backend/`)

Scorecard (current setup)

Open questions to resolve FIRST (Phase 0)

Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)

Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")

Phase 3 — Cutover (parallel-run → flip)

Phase 4 — Observability, alerting & SPOF

Gotchas / landmines (read before coding)

References

Suggested decomposition (optional)

Subtickets (milestone: Stream rework: recording hardening + Icecast migration)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Requirement	Rating	Why
1	Stability / reliable	Poor	Single-maintainer Node process, unfixed OOM history, no supervision/failover, single-VPS SPOF.
2	Test before go-live	Absent	Browser loopback only; no server-side test mount/instance; one hardcoded prod key.
3	Always recorded → R2	Fragile	Decoupled (good) but opt-in/client-coupled, single unbounded WebM (interrupted = unrecoverable), single-shot upload, silent write failures, no integrity/duration check, no alert if a show didn't record.
4	Low latency	Good	FLV path ~1–3 s; latency is not the problem.
5	Live metrics	Poor	Zero telemetry captured; no listener count at all.
6	Other	Poor	Hardcoded `stream-io` key = hijack risk, no rotation/per-broadcast auth; no alerting; TLS/mixed-content unaddressed.

Stream rework: harden R2 recording + migrate relay node-media-server → Icecast-KH/Liquidsoap (audio-only) #164

Description

Goal

Decisions already locked (do NOT re-litigate)

Current architecture (verified in source — backend/)

Scorecard (current setup)

Open questions to resolve FIRST (Phase 0)

Phase 1 — Harden recording (HIGHEST VALUE, relay-independent — do this first)

Phase 2 — Stand up Icecast-KH + Liquidsoap (also satisfies #2 "test instance")

Phase 3 — Cutover (parallel-run → flip)

Phase 4 — Observability, alerting & SPOF

Gotchas / landmines (read before coding)

References

Suggested decomposition (optional)

Subtickets (milestone: Stream rework: recording hardening + Icecast migration)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Current architecture (verified in source — `backend/`)