Skip to content

Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146

Open
wowa1990 wants to merge 12 commits into
splitti:mainfrom
wowa1990:pr-backend-api-reliability
Open

Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146
wowa1990 wants to merge 12 commits into
splitti:mainfrom
wowa1990:pr-backend-api-reliability

Conversation

@wowa1990

@wowa1990 wowa1990 commented May 16, 2026

Copy link
Copy Markdown

⚠️ Depends on #151 (Daily Playtime Limit) — this PR's server.ts imports PlaytimeStatus from ./models/playtime.model, first added in #151. Merge #151 before this one.

Was es macht

  • Lock-Race in /api/add /api/delete /api/edit /api/addresume /api/deleteresume per zentralem acquireLock-Helper mit atomarem 'wx' + Stale-Lock-Recovery (AR5-6 / M8)
  • Self-Heal bei korruptem resume.json (Parse-Fail → Backup + Neustart mit leerem Array statt 500er)
  • /api/data, /api/mupihat, /api/wlan hängen sich nicht mehr auf, wenn die Quell-Datei fehlt
  • /api/rssfeed macht jetzt HEAD-Probe vor GET (verhindert Hang auf große Feeds)
  • /api/activeresume + /api/resume tolerant für fehlende Files
  • Resume-Page-Latenz: parallelisierte Auflösung statt sequentiell
  • Race-Window zwischen deleteresume und addresume durch eine in-Memory-Map abgedichtet (AR5-18)

Architektur (kurz)

Zentraler acquireLock(path, mode)-Helper in server.ts, der überall die Lockfile-Handhabung übernimmt. Stale-Lock-Recovery löst die Klasse von „Server hängt, wenn er beim Schreiben gekillt wurde". Resume-Map ist process-local und verschwindet bei Neustart — bewusst keine Persistenz.

Test in Codespaces / lokal

  1. Branch auschecken, npm install
  2. cp config/templates/www.json src/backend-api/config/config.json + monitor.json kopieren
  3. npm run serve:backend-api starten
  4. Konkrete Tests:
    • Lösche monitor.json während Backend läuft → curl http://127.0.0.1:8200/api/mupihat → erwartet 200 mit Empty/Default-Payload, kein Hang
    • Schreibe Müll in resume.json (echo "{not json" > resume.json) → Backend-Restart → erwartet Auto-Recovery + resume.json.bak.<ts>
    • Parallel zwei curl -X POST /api/addresume mit gleicher Lib-ID feuern → erwartet ein-und-nur-eines committed, kein Lockfile-Leftover
    • curl http://127.0.0.1:8200/api/rssfeed?url=http://httpstat.us/200?sleep=30000 → erwartet zügiger Timeout statt 30s Hang

Offene Punkte

  • Stale-Lock-Timeout aktuell 30s — Trade-off zwischen „echte Concurrent-Writer nicht stören" und „abgestürzten Server nicht ewig warten lassen". Falls dir 10s lieber wären: Konstante in acquireLock ist klar markiert.

wowa1990 added 12 commits May 16, 2026 15:48
…file; 500 instead of throw on /api/addwlan

Three GET endpoints had `if (fs.existsSync(file)) { ...readFile... }`
with no else branch, so when the file was missing the request just
sat there with no response — the frontend's loading spinner spun
forever and HTTP keep-alive eventually timed it out client-side.
Mirror the /api/resume pattern (already fixed earlier): early-return
[] (or {} for /api/mupihat which is an object) when the file is
absent. /api/data hits this on a fresh box before active_data.json is
generated; /api/mupihat hits it on every box without a MuPiHAT board;
/api/wlan hits it before the first scan.

/api/addwlan ran `jsonfile.writeFile(...)` and inside the node-style
callback did `if (writeError) throw error` (note: throwing the OUTER
`error` variable from the read step, not even the relevant one).
Throws inside async callbacks bypass Express's error middleware and
crash the backend-api process — the player goes offline until pm2
restarts it. Send a 500 with a log line instead, the same shape every
other write endpoint already uses.
The unlink in /api/addresume and /api/editresume ran outside the async
readFile callback, so the lock was released before the write completed —
two concurrent calls could clobber each other. Move unlink into the
writeFile callback (success and error paths) and short-circuit on lock
acquisition failure.

Dedup matched on `req.body.id` only, but Media.id is optional — Library
and RSS items often have no id, so id-less entries either collapsed onto
the first match or piled up duplicates. Replace with a stable composite
key (type + playlistid|showid|audiobookid|id || artist::title) and use it
in both endpoints. /api/editresume keeps a defensive index fallback for
the edge case where the caller really has a usable index.
active_resume.json is a symlink set up by check_network.sh once the
network state is known. In the window between boot and that script's
first iteration the symlink doesn't exist yet — the previous code took
neither branch and the response hung, leaving the resume page stuck on
Loading. Return an empty array in that case, mirroring the existing
read-error fallback.
Callers of /api/resume (resume.page lookup, addresume read pre-step)
always expect an array. On a fresh box resume.json doesn't exist until
check_network.sh creates it or the first save lands, and a 404 in that
window crashed any caller doing .length / .findIndex on the response.
Return an empty array — same convention as /api/data and (after the
previous commit) /api/activeresume. Read errors now also degrade to []
instead of 500 to keep the contract uniform.
Same bug as the previous resume-lock fix: fs.unlink on dataLock ran
outside the async readFile callback, so the lock was released before the
write completed. Two concurrent admin saves could clobber each other.

Generalised releaseResumeLock into releaseLock(lockPath, context) and
rewrote the three data.json endpoints to release the lock from inside
the writeFile callback (success and read-error paths), with explicit
short-circuiting on lock-acquisition failure. Resume endpoints now use
the same shared helper.
addresume/editresume bailed out with "error" on any read failure — both
the ENOENT case (fresh box, never saved) and the JSON-parse case (one
bad write or filesystem hiccup leaves the file unparseable). Either way,
no further save would land until somebody manually fixed the file.

Add readResumeOrRecover(): missing file is treated as []; corrupt file
is renamed to resume.json.bak.<epoch> (so it can be inspected later) and
the live save proceeds against an empty array. Losing one session of
accumulated entries on rare corruption beats wedging the feature for
the rest of the box's lifetime.
A "weiterhören" suggestion at 99% of the last track is just amusing —
the kid finished the audiobook. Hook the existing mplayer playlist-finish
event in the backend-player to POST /api/deleteresume to the backend-api
with a minimal {type, artist, title} body; the new endpoint matches by
composite key (same logic as addresume) and idempotently removes the
entry. Body is best-effort — if the call fails, the resume entry just
stays around, no playback impact.

Spotify and RSS are intentionally skipped: Spotify has no clean end-of-
album signal through the mplayer-wrapper, and currentMeta doesn't carry
enough state for an RSS feed's composite key. Both can be revisited if
they actually become a problem in practice.
server.ts had five copies of the same broken lock-acquire pattern:

  if (fs.existsSync(lockPath)) { send('locked'); return }
  try { fs.openSync(lockPath, 'w') }
  catch (err) { send('error'); return }

Two problems:

M8 — race between existsSync and openSync: another worker can win the
   race in between, both threads then "hold" the lock simultaneously.
   Plus 'w' opens with O_CREAT but TRUNCATES if the file already
   exists, so openSync never actually fails — the "lock" is just a
   marker file relying on existsSync + releaseLock cooperating.

AR5-6 — pm2 mid-write crash leaves the lock file on disk forever.
   /api/add|edit|delete|addresume|deleteresume all return 'locked'
   until the box reboots. No automatic recovery path exists.

Fix: single acquireLock(lockPath, context) helper that does atomic
test-and-set with O_EXCL | O_CREAT (Node 'wx' flag). On EEXIST it
checks the lock's mtime: older than LOCK_STALE_MS (30s) -> steal it
once and retry; younger -> caller sees 'locked' as before. Returns
discriminated 'acquired' | 'locked' | 'error' so callers route their
responses cleanly.

Plus a one-time start-up pass at the top of server.ts: any lock file
older than LOCK_STALE_MS is removed before request handlers come
online, with a warn log so the cleanup is visible in pm2 boot
output. This catches the pm2-crash-restart case (which is the
common AR5-6 trigger) without waiting for the next user request to
notice the stale lock. Crashes that happen after start-up are
handled by the in-band stale-lock recovery in acquireLock.

All five call sites switched to the helper. No behaviour change for
the happy path — same 'locked' / 'error' / 'ok' responses, same
release pattern. tsc --noEmit clean.
… (AR5-18)

mplayer playlist-finish fires backend-player -> POST /api/deleteresume,
which removes the just-completed album from resume.json so the kid
isn't offered "weiterhören" at second 59:58 of an album they just
heard end-to-end. Simultaneously the frontend's player page observes
the playback state going to paused/stopped and POSTs /api/addresume
to save the last position.

The file lock serialises the two writes, but their order is
non-deterministic. When addresume wins after deleteresume, the entry
is resurrected -- defeating the entire point of the playlist-finish
cleanup.

Fix: track recently-deleted composite keys in an in-memory Map keyed
by resumeKeyOf(media). /api/deleteresume always notes the deletion
(even on no-match -- the frontend's pending addresume comes either
way). /api/addresume consults the map before insert and silently
skips with 200 ok if the same key was deleted within 2500 ms.

Map self-cleans on lookup (stale entries removed when checked) plus
a periodic 60s sweep so a long-running backend doesn't accumulate
dead keys forever. setInterval marked .unref() so the timer doesn't
hold the event loop open during graceful shutdown.

tsc --noEmit clean.
The Spotify SDK's underlying fetch and our embed-page fetch both lack
a built-in timeout. A TCP-level stall (silent upstream) leaves the
promise pending forever, the queueRequest pendingRequests entry never
settles, and isProcessingQueue stays true — so every subsequent
same-key request also hangs and the queue stops processing entirely.

- spotify-api.service: wrap rateLimitedRequest in a 20s Promise.race
  timeout so SDK calls always settle.
- spotify-media-info.service: pass AbortSignal.timeout(20000) to the
  embed-URL fetch to match.

20s is generous; the slowest legitimate response we observe is ~3-4s
for an audiobook with hundreds of chapters.
Two-layer fix for the 24s resume-page freeze.

Frontend (revert MED-10): media.service.ts gates getRssFeed back to
non-resume entries only. The MED-10 (Phase 4) fix had dropped the
gate to "enrich RSS resume entries with fresh feed data", but the
`id` field of an RSS resume entry is the episode's *MP3 URL*, not
the channel's RSS feed URL. The enrichment fetch streamed multi-
megabyte MP3 audio through the rss-parser path until MED-2's 5MB
size cap aborted with HTTP 413 after ~3.9s per entry. Six RSS
resume entries on the resume page = ~24s of frozen UI. Every
field needed to render the resume tile is already on disk; nothing
to enrich for resume — just rebuild the gate.

Backend (defence in depth): /api/rssfeed now probes with HEAD
before the full GET. Native fetch (not ky — ky was silently failing
on the 301-redirect chain in this codepath, so the probe never
fired). Rejects with 415 on non-XML/RSS content-type or 413 if
advertised content-length exceeds 5MB, both in <500ms. Falls
through to GET on HEAD failure so origins that don't support HEAD
still work. Catches future similar pattern breaks no matter which
caller fires them.

Squashed from: 9fc33bf8 e232eb2d
The resume page used a blind `media.reverse()` on the active-resume
fetch, matching file insertion order. /api/addresume's update-in-
place pattern leaves an entry's array index untouched when it's
already present — so a freshly-played album never moved to position
1 unless it was new. User-visible: "letztes gehörtes Album steht
nicht oben, Reihenfolge nicht nachvollziehbar".

Backend (server.ts):
  - /api/addresume sets lastPlayedAt: Date.now() on every save.
  - backfillLastPlayedAt() helper synthesises stamps for legacy
    entries that pre-date the field. idx 0 → largest synthetic
    stamp, idx N → smallest (because update-in-place typically
    left the most-recently-replayed entry at idx 0; this places
    it at the top after DESC sort). Real Date.now() saves always
    beat synthetic stamps.
  - /api/{addresume,resume,activeresume} all run the helper —
    /addresume persists, the GET endpoints lazy-backfill in-
    memory so the visible order is correct from the first read
    after deploy, before any save fires.

Frontend:
  - media.ts: add `lastPlayedAt?: number` to Media interface.
  - utils.ts: lastPlayedAt joins ExtraDataMedia merge keys (defence
    in depth — the spotify/audiobook/episode/playlist services
    don't actually call copyExtraMediaData today, but if a future
    caller does we want the field to flow through).
  - media.service.ts: the overwriteArtist operator (applied on
    every iif-branch's output in updateMedia) now also carries
    lastPlayedAt + isResume from the original item. THIS is the
    real fix — getMediaByID and siblings build a fresh Media from
    upstream Spotify API data without copyExtraMediaData, and
    would otherwise drop the field.
  - fetchActiveResumeData sorts DESC by lastPlayedAt instead of
    blind reverse().

Net: most-recently-played album lands at swiper position 1.

Squashed from: c265a501 25305582 733eda39 fbfc3e97 fa10606c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant