Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146
Open
wowa1990 wants to merge 12 commits into
Open
Backend API Reliability — atomic locks, self-heal, hang-on-missing-file#146wowa1990 wants to merge 12 commits into
wowa1990 wants to merge 12 commits into
Conversation
…file; 500 instead of throw on /api/addwlan
Three GET endpoints had `if (fs.existsSync(file)) { ...readFile... }`
with no else branch, so when the file was missing the request just
sat there with no response — the frontend's loading spinner spun
forever and HTTP keep-alive eventually timed it out client-side.
Mirror the /api/resume pattern (already fixed earlier): early-return
[] (or {} for /api/mupihat which is an object) when the file is
absent. /api/data hits this on a fresh box before active_data.json is
generated; /api/mupihat hits it on every box without a MuPiHAT board;
/api/wlan hits it before the first scan.
/api/addwlan ran `jsonfile.writeFile(...)` and inside the node-style
callback did `if (writeError) throw error` (note: throwing the OUTER
`error` variable from the read step, not even the relevant one).
Throws inside async callbacks bypass Express's error middleware and
crash the backend-api process — the player goes offline until pm2
restarts it. Send a 500 with a log line instead, the same shape every
other write endpoint already uses.
The unlink in /api/addresume and /api/editresume ran outside the async readFile callback, so the lock was released before the write completed — two concurrent calls could clobber each other. Move unlink into the writeFile callback (success and error paths) and short-circuit on lock acquisition failure. Dedup matched on `req.body.id` only, but Media.id is optional — Library and RSS items often have no id, so id-less entries either collapsed onto the first match or piled up duplicates. Replace with a stable composite key (type + playlistid|showid|audiobookid|id || artist::title) and use it in both endpoints. /api/editresume keeps a defensive index fallback for the edge case where the caller really has a usable index.
active_resume.json is a symlink set up by check_network.sh once the network state is known. In the window between boot and that script's first iteration the symlink doesn't exist yet — the previous code took neither branch and the response hung, leaving the resume page stuck on Loading. Return an empty array in that case, mirroring the existing read-error fallback.
Callers of /api/resume (resume.page lookup, addresume read pre-step) always expect an array. On a fresh box resume.json doesn't exist until check_network.sh creates it or the first save lands, and a 404 in that window crashed any caller doing .length / .findIndex on the response. Return an empty array — same convention as /api/data and (after the previous commit) /api/activeresume. Read errors now also degrade to [] instead of 500 to keep the contract uniform.
Same bug as the previous resume-lock fix: fs.unlink on dataLock ran outside the async readFile callback, so the lock was released before the write completed. Two concurrent admin saves could clobber each other. Generalised releaseResumeLock into releaseLock(lockPath, context) and rewrote the three data.json endpoints to release the lock from inside the writeFile callback (success and read-error paths), with explicit short-circuiting on lock-acquisition failure. Resume endpoints now use the same shared helper.
addresume/editresume bailed out with "error" on any read failure — both the ENOENT case (fresh box, never saved) and the JSON-parse case (one bad write or filesystem hiccup leaves the file unparseable). Either way, no further save would land until somebody manually fixed the file. Add readResumeOrRecover(): missing file is treated as []; corrupt file is renamed to resume.json.bak.<epoch> (so it can be inspected later) and the live save proceeds against an empty array. Losing one session of accumulated entries on rare corruption beats wedging the feature for the rest of the box's lifetime.
A "weiterhören" suggestion at 99% of the last track is just amusing —
the kid finished the audiobook. Hook the existing mplayer playlist-finish
event in the backend-player to POST /api/deleteresume to the backend-api
with a minimal {type, artist, title} body; the new endpoint matches by
composite key (same logic as addresume) and idempotently removes the
entry. Body is best-effort — if the call fails, the resume entry just
stays around, no playback impact.
Spotify and RSS are intentionally skipped: Spotify has no clean end-of-
album signal through the mplayer-wrapper, and currentMeta doesn't carry
enough state for an RSS feed's composite key. Both can be revisited if
they actually become a problem in practice.
server.ts had five copies of the same broken lock-acquire pattern:
if (fs.existsSync(lockPath)) { send('locked'); return }
try { fs.openSync(lockPath, 'w') }
catch (err) { send('error'); return }
Two problems:
M8 — race between existsSync and openSync: another worker can win the
race in between, both threads then "hold" the lock simultaneously.
Plus 'w' opens with O_CREAT but TRUNCATES if the file already
exists, so openSync never actually fails — the "lock" is just a
marker file relying on existsSync + releaseLock cooperating.
AR5-6 — pm2 mid-write crash leaves the lock file on disk forever.
/api/add|edit|delete|addresume|deleteresume all return 'locked'
until the box reboots. No automatic recovery path exists.
Fix: single acquireLock(lockPath, context) helper that does atomic
test-and-set with O_EXCL | O_CREAT (Node 'wx' flag). On EEXIST it
checks the lock's mtime: older than LOCK_STALE_MS (30s) -> steal it
once and retry; younger -> caller sees 'locked' as before. Returns
discriminated 'acquired' | 'locked' | 'error' so callers route their
responses cleanly.
Plus a one-time start-up pass at the top of server.ts: any lock file
older than LOCK_STALE_MS is removed before request handlers come
online, with a warn log so the cleanup is visible in pm2 boot
output. This catches the pm2-crash-restart case (which is the
common AR5-6 trigger) without waiting for the next user request to
notice the stale lock. Crashes that happen after start-up are
handled by the in-band stale-lock recovery in acquireLock.
All five call sites switched to the helper. No behaviour change for
the happy path — same 'locked' / 'error' / 'ok' responses, same
release pattern. tsc --noEmit clean.
… (AR5-18) mplayer playlist-finish fires backend-player -> POST /api/deleteresume, which removes the just-completed album from resume.json so the kid isn't offered "weiterhören" at second 59:58 of an album they just heard end-to-end. Simultaneously the frontend's player page observes the playback state going to paused/stopped and POSTs /api/addresume to save the last position. The file lock serialises the two writes, but their order is non-deterministic. When addresume wins after deleteresume, the entry is resurrected -- defeating the entire point of the playlist-finish cleanup. Fix: track recently-deleted composite keys in an in-memory Map keyed by resumeKeyOf(media). /api/deleteresume always notes the deletion (even on no-match -- the frontend's pending addresume comes either way). /api/addresume consults the map before insert and silently skips with 200 ok if the same key was deleted within 2500 ms. Map self-cleans on lookup (stale entries removed when checked) plus a periodic 60s sweep so a long-running backend doesn't accumulate dead keys forever. setInterval marked .unref() so the timer doesn't hold the event loop open during graceful shutdown. tsc --noEmit clean.
The Spotify SDK's underlying fetch and our embed-page fetch both lack a built-in timeout. A TCP-level stall (silent upstream) leaves the promise pending forever, the queueRequest pendingRequests entry never settles, and isProcessingQueue stays true — so every subsequent same-key request also hangs and the queue stops processing entirely. - spotify-api.service: wrap rateLimitedRequest in a 20s Promise.race timeout so SDK calls always settle. - spotify-media-info.service: pass AbortSignal.timeout(20000) to the embed-URL fetch to match. 20s is generous; the slowest legitimate response we observe is ~3-4s for an audiobook with hundreds of chapters.
Two-layer fix for the 24s resume-page freeze. Frontend (revert MED-10): media.service.ts gates getRssFeed back to non-resume entries only. The MED-10 (Phase 4) fix had dropped the gate to "enrich RSS resume entries with fresh feed data", but the `id` field of an RSS resume entry is the episode's *MP3 URL*, not the channel's RSS feed URL. The enrichment fetch streamed multi- megabyte MP3 audio through the rss-parser path until MED-2's 5MB size cap aborted with HTTP 413 after ~3.9s per entry. Six RSS resume entries on the resume page = ~24s of frozen UI. Every field needed to render the resume tile is already on disk; nothing to enrich for resume — just rebuild the gate. Backend (defence in depth): /api/rssfeed now probes with HEAD before the full GET. Native fetch (not ky — ky was silently failing on the 301-redirect chain in this codepath, so the probe never fired). Rejects with 415 on non-XML/RSS content-type or 413 if advertised content-length exceeds 5MB, both in <500ms. Falls through to GET on HEAD failure so origins that don't support HEAD still work. Catches future similar pattern breaks no matter which caller fires them. Squashed from: 9fc33bf8 e232eb2d
The resume page used a blind `media.reverse()` on the active-resume
fetch, matching file insertion order. /api/addresume's update-in-
place pattern leaves an entry's array index untouched when it's
already present — so a freshly-played album never moved to position
1 unless it was new. User-visible: "letztes gehörtes Album steht
nicht oben, Reihenfolge nicht nachvollziehbar".
Backend (server.ts):
- /api/addresume sets lastPlayedAt: Date.now() on every save.
- backfillLastPlayedAt() helper synthesises stamps for legacy
entries that pre-date the field. idx 0 → largest synthetic
stamp, idx N → smallest (because update-in-place typically
left the most-recently-replayed entry at idx 0; this places
it at the top after DESC sort). Real Date.now() saves always
beat synthetic stamps.
- /api/{addresume,resume,activeresume} all run the helper —
/addresume persists, the GET endpoints lazy-backfill in-
memory so the visible order is correct from the first read
after deploy, before any save fires.
Frontend:
- media.ts: add `lastPlayedAt?: number` to Media interface.
- utils.ts: lastPlayedAt joins ExtraDataMedia merge keys (defence
in depth — the spotify/audiobook/episode/playlist services
don't actually call copyExtraMediaData today, but if a future
caller does we want the field to flow through).
- media.service.ts: the overwriteArtist operator (applied on
every iif-branch's output in updateMedia) now also carries
lastPlayedAt + isResume from the original item. THIS is the
real fix — getMediaByID and siblings build a fresh Media from
upstream Spotify API data without copyExtraMediaData, and
would otherwise drop the field.
- fetchActiveResumeData sorts DESC by lastPlayedAt instead of
blind reverse().
Net: most-recently-played album lands at swiper position 1.
Squashed from: c265a501 25305582 733eda39 fbfc3e97 fa10606c
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Was es macht
/api/add/api/delete/api/edit/api/addresume/api/deleteresumeper zentralemacquireLock-Helper mit atomarem'wx'+ Stale-Lock-Recovery (AR5-6 / M8)resume.json(Parse-Fail → Backup + Neustart mit leerem Array statt 500er)/api/data,/api/mupihat,/api/wlanhängen sich nicht mehr auf, wenn die Quell-Datei fehlt/api/rssfeedmacht jetzt HEAD-Probe vor GET (verhindert Hang auf große Feeds)/api/activeresume+/api/resumetolerant für fehlende Filesdeleteresumeundaddresumedurch eine in-Memory-Map abgedichtet (AR5-18)Architektur (kurz)
Zentraler
acquireLock(path, mode)-Helper inserver.ts, der überall die Lockfile-Handhabung übernimmt. Stale-Lock-Recovery löst die Klasse von „Server hängt, wenn er beim Schreiben gekillt wurde". Resume-Map ist process-local und verschwindet bei Neustart — bewusst keine Persistenz.Test in Codespaces / lokal
npm installcp config/templates/www.json src/backend-api/config/config.json+monitor.jsonkopierennpm run serve:backend-apistartenmonitor.jsonwährend Backend läuft →curl http://127.0.0.1:8200/api/mupihat→ erwartet 200 mit Empty/Default-Payload, kein Hangresume.json(echo "{not json" > resume.json) → Backend-Restart → erwartet Auto-Recovery +resume.json.bak.<ts>curl -X POST /api/addresumemit gleicher Lib-ID feuern → erwartet ein-und-nur-eines committed, kein Lockfile-Leftovercurl http://127.0.0.1:8200/api/rssfeed?url=http://httpstat.us/200?sleep=30000→ erwartet zügiger Timeout statt 30s HangOffene Punkte
acquireLockist klar markiert.