Backend API Reliability — atomic locks, self-heal, hang-on-missing-file by wowa1990 · Pull Request #146 · splitti/MuPiBox

wowa1990 · 2026-05-16T16:58:36Z

⚠️ Depends on #151 (Daily Playtime Limit) — this PR's server.ts imports PlaytimeStatus from ./models/playtime.model, first added in #151. Merge #151 before this one.

Was es macht

Lock-Race in /api/add /api/delete /api/edit /api/addresume /api/deleteresume per zentralem acquireLock-Helper mit atomarem 'wx' + Stale-Lock-Recovery (AR5-6 / M8)
Self-Heal bei korruptem resume.json (Parse-Fail → Backup + Neustart mit leerem Array statt 500er)
/api/data, /api/mupihat, /api/wlan hängen sich nicht mehr auf, wenn die Quell-Datei fehlt
/api/rssfeed macht jetzt HEAD-Probe vor GET (verhindert Hang auf große Feeds)
/api/activeresume + /api/resume tolerant für fehlende Files
Resume-Page-Latenz: parallelisierte Auflösung statt sequentiell
Race-Window zwischen deleteresume und addresume durch eine in-Memory-Map abgedichtet (AR5-18)

Architektur (kurz)

Zentraler acquireLock(path, mode)-Helper in server.ts, der überall die Lockfile-Handhabung übernimmt. Stale-Lock-Recovery löst die Klasse von „Server hängt, wenn er beim Schreiben gekillt wurde". Resume-Map ist process-local und verschwindet bei Neustart — bewusst keine Persistenz.

Test in Codespaces / lokal

Branch auschecken, npm install
cp config/templates/www.json src/backend-api/config/config.json + monitor.json kopieren
npm run serve:backend-api starten
Konkrete Tests:
- Lösche monitor.json während Backend läuft → curl http://127.0.0.1:8200/api/mupihat → erwartet 200 mit Empty/Default-Payload, kein Hang
- Schreibe Müll in resume.json (echo "{not json" > resume.json) → Backend-Restart → erwartet Auto-Recovery + resume.json.bak.<ts>
- Parallel zwei curl -X POST /api/addresume mit gleicher Lib-ID feuern → erwartet ein-und-nur-eines committed, kein Lockfile-Leftover
- curl http://127.0.0.1:8200/api/rssfeed?url=http://httpstat.us/200?sleep=30000 → erwartet zügiger Timeout statt 30s Hang

Offene Punkte

Stale-Lock-Timeout aktuell 30s — Trade-off zwischen „echte Concurrent-Writer nicht stören" und „abgestürzten Server nicht ewig warten lassen". Falls dir 10s lieber wären: Konstante in acquireLock ist klar markiert.

…file; 500 instead of throw on /api/addwlan Three GET endpoints had `if (fs.existsSync(file)) { ...readFile... }` with no else branch, so when the file was missing the request just sat there with no response — the frontend's loading spinner spun forever and HTTP keep-alive eventually timed it out client-side. Mirror the /api/resume pattern (already fixed earlier): early-return [] (or {} for /api/mupihat which is an object) when the file is absent. /api/data hits this on a fresh box before active_data.json is generated; /api/mupihat hits it on every box without a MuPiHAT board; /api/wlan hits it before the first scan. /api/addwlan ran `jsonfile.writeFile(...)` and inside the node-style callback did `if (writeError) throw error` (note: throwing the OUTER `error` variable from the read step, not even the relevant one). Throws inside async callbacks bypass Express's error middleware and crash the backend-api process — the player goes offline until pm2 restarts it. Send a 500 with a log line instead, the same shape every other write endpoint already uses.

The unlink in /api/addresume and /api/editresume ran outside the async readFile callback, so the lock was released before the write completed — two concurrent calls could clobber each other. Move unlink into the writeFile callback (success and error paths) and short-circuit on lock acquisition failure. Dedup matched on `req.body.id` only, but Media.id is optional — Library and RSS items often have no id, so id-less entries either collapsed onto the first match or piled up duplicates. Replace with a stable composite key (type + playlistid|showid|audiobookid|id || artist::title) and use it in both endpoints. /api/editresume keeps a defensive index fallback for the edge case where the caller really has a usable index.

active_resume.json is a symlink set up by check_network.sh once the network state is known. In the window between boot and that script's first iteration the symlink doesn't exist yet — the previous code took neither branch and the response hung, leaving the resume page stuck on Loading. Return an empty array in that case, mirroring the existing read-error fallback.

Callers of /api/resume (resume.page lookup, addresume read pre-step) always expect an array. On a fresh box resume.json doesn't exist until check_network.sh creates it or the first save lands, and a 404 in that window crashed any caller doing .length / .findIndex on the response. Return an empty array — same convention as /api/data and (after the previous commit) /api/activeresume. Read errors now also degrade to [] instead of 500 to keep the contract uniform.

Same bug as the previous resume-lock fix: fs.unlink on dataLock ran outside the async readFile callback, so the lock was released before the write completed. Two concurrent admin saves could clobber each other. Generalised releaseResumeLock into releaseLock(lockPath, context) and rewrote the three data.json endpoints to release the lock from inside the writeFile callback (success and read-error paths), with explicit short-circuiting on lock-acquisition failure. Resume endpoints now use the same shared helper.

addresume/editresume bailed out with "error" on any read failure — both the ENOENT case (fresh box, never saved) and the JSON-parse case (one bad write or filesystem hiccup leaves the file unparseable). Either way, no further save would land until somebody manually fixed the file. Add readResumeOrRecover(): missing file is treated as []; corrupt file is renamed to resume.json.bak.<epoch> (so it can be inspected later) and the live save proceeds against an empty array. Losing one session of accumulated entries on rare corruption beats wedging the feature for the rest of the box's lifetime.

A "weiterhören" suggestion at 99% of the last track is just amusing — the kid finished the audiobook. Hook the existing mplayer playlist-finish event in the backend-player to POST /api/deleteresume to the backend-api with a minimal {type, artist, title} body; the new endpoint matches by composite key (same logic as addresume) and idempotently removes the entry. Body is best-effort — if the call fails, the resume entry just stays around, no playback impact. Spotify and RSS are intentionally skipped: Spotify has no clean end-of- album signal through the mplayer-wrapper, and currentMeta doesn't carry enough state for an RSS feed's composite key. Both can be revisited if they actually become a problem in practice.

server.ts had five copies of the same broken lock-acquire pattern: if (fs.existsSync(lockPath)) { send('locked'); return } try { fs.openSync(lockPath, 'w') } catch (err) { send('error'); return } Two problems: M8 — race between existsSync and openSync: another worker can win the race in between, both threads then "hold" the lock simultaneously. Plus 'w' opens with O_CREAT but TRUNCATES if the file already exists, so openSync never actually fails — the "lock" is just a marker file relying on existsSync + releaseLock cooperating. AR5-6 — pm2 mid-write crash leaves the lock file on disk forever. /api/add|edit|delete|addresume|deleteresume all return 'locked' until the box reboots. No automatic recovery path exists. Fix: single acquireLock(lockPath, context) helper that does atomic test-and-set with O_EXCL | O_CREAT (Node 'wx' flag). On EEXIST it checks the lock's mtime: older than LOCK_STALE_MS (30s) -> steal it once and retry; younger -> caller sees 'locked' as before. Returns discriminated 'acquired' | 'locked' | 'error' so callers route their responses cleanly. Plus a one-time start-up pass at the top of server.ts: any lock file older than LOCK_STALE_MS is removed before request handlers come online, with a warn log so the cleanup is visible in pm2 boot output. This catches the pm2-crash-restart case (which is the common AR5-6 trigger) without waiting for the next user request to notice the stale lock. Crashes that happen after start-up are handled by the in-band stale-lock recovery in acquireLock. All five call sites switched to the helper. No behaviour change for the happy path — same 'locked' / 'error' / 'ok' responses, same release pattern. tsc --noEmit clean.

… (AR5-18) mplayer playlist-finish fires backend-player -> POST /api/deleteresume, which removes the just-completed album from resume.json so the kid isn't offered "weiterhören" at second 59:58 of an album they just heard end-to-end. Simultaneously the frontend's player page observes the playback state going to paused/stopped and POSTs /api/addresume to save the last position. The file lock serialises the two writes, but their order is non-deterministic. When addresume wins after deleteresume, the entry is resurrected -- defeating the entire point of the playlist-finish cleanup. Fix: track recently-deleted composite keys in an in-memory Map keyed by resumeKeyOf(media). /api/deleteresume always notes the deletion (even on no-match -- the frontend's pending addresume comes either way). /api/addresume consults the map before insert and silently skips with 200 ok if the same key was deleted within 2500 ms. Map self-cleans on lookup (stale entries removed when checked) plus a periodic 60s sweep so a long-running backend doesn't accumulate dead keys forever. setInterval marked .unref() so the timer doesn't hold the event loop open during graceful shutdown. tsc --noEmit clean.

The Spotify SDK's underlying fetch and our embed-page fetch both lack a built-in timeout. A TCP-level stall (silent upstream) leaves the promise pending forever, the queueRequest pendingRequests entry never settles, and isProcessingQueue stays true — so every subsequent same-key request also hangs and the queue stops processing entirely. - spotify-api.service: wrap rateLimitedRequest in a 20s Promise.race timeout so SDK calls always settle. - spotify-media-info.service: pass AbortSignal.timeout(20000) to the embed-URL fetch to match. 20s is generous; the slowest legitimate response we observe is ~3-4s for an audiobook with hundreds of chapters.

Two-layer fix for the 24s resume-page freeze. Frontend (revert MED-10): media.service.ts gates getRssFeed back to non-resume entries only. The MED-10 (Phase 4) fix had dropped the gate to "enrich RSS resume entries with fresh feed data", but the `id` field of an RSS resume entry is the episode's *MP3 URL*, not the channel's RSS feed URL. The enrichment fetch streamed multi- megabyte MP3 audio through the rss-parser path until MED-2's 5MB size cap aborted with HTTP 413 after ~3.9s per entry. Six RSS resume entries on the resume page = ~24s of frozen UI. Every field needed to render the resume tile is already on disk; nothing to enrich for resume — just rebuild the gate. Backend (defence in depth): /api/rssfeed now probes with HEAD before the full GET. Native fetch (not ky — ky was silently failing on the 301-redirect chain in this codepath, so the probe never fired). Rejects with 415 on non-XML/RSS content-type or 413 if advertised content-length exceeds 5MB, both in <500ms. Falls through to GET on HEAD failure so origins that don't support HEAD still work. Catches future similar pattern breaks no matter which caller fires them. Squashed from: 9fc33bf8 e232eb2d

The resume page used a blind `media.reverse()` on the active-resume fetch, matching file insertion order. /api/addresume's update-in- place pattern leaves an entry's array index untouched when it's already present — so a freshly-played album never moved to position 1 unless it was new. User-visible: "letztes gehörtes Album steht nicht oben, Reihenfolge nicht nachvollziehbar". Backend (server.ts): - /api/addresume sets lastPlayedAt: Date.now() on every save. - backfillLastPlayedAt() helper synthesises stamps for legacy entries that pre-date the field. idx 0 → largest synthetic stamp, idx N → smallest (because update-in-place typically left the most-recently-replayed entry at idx 0; this places it at the top after DESC sort). Real Date.now() saves always beat synthetic stamps. - /api/{addresume,resume,activeresume} all run the helper — /addresume persists, the GET endpoints lazy-backfill in- memory so the visible order is correct from the first read after deploy, before any save fires. Frontend: - media.ts: add `lastPlayedAt?: number` to Media interface. - utils.ts: lastPlayedAt joins ExtraDataMedia merge keys (defence in depth — the spotify/audiobook/episode/playlist services don't actually call copyExtraMediaData today, but if a future caller does we want the field to flow through). - media.service.ts: the overwriteArtist operator (applied on every iif-branch's output in updateMedia) now also carries lastPlayedAt + isResume from the original item. THIS is the real fix — getMediaByID and siblings build a fresh Media from upstream Spotify API data without copyExtraMediaData, and would otherwise drop the field. - fetchActiveResumeData sorts DESC by lastPlayedAt instead of blind reverse(). Net: most-recently-played album lands at swiper position 1. Squashed from: c265a501 25305582 733eda39 fbfc3e97 fa10606c

wowa1990 added 12 commits May 16, 2026 15:48

wowa1990 mentioned this pull request Jun 6, 2026

Performance & Caching — pagination, cover-cache, in-mem-LRU, async writes #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146

Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146
wowa1990 wants to merge 12 commits into
splitti:mainfrom
wowa1990:pr-backend-api-reliability

wowa1990 commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wowa1990 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was es macht

Architektur (kurz)

Test in Codespaces / lokal

Offene Punkte

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wowa1990 commented May 16, 2026 •

edited

Loading