Bash & systemd Reliability — truncate races, jq quoting, Restart=on-failure#145
Open
wowa1990 wants to merge 14 commits into
Open
Bash & systemd Reliability — truncate races, jq quoting, Restart=on-failure#145wowa1990 wants to merge 14 commits into
wowa1990 wants to merge 14 commits into
Conversation
The pattern `/usr/bin/cat <<< $(/usr/bin/jq … FILE) > FILE` showed up
55 times across 12 scripts in scripts/mupibox/. It writes back through
two indirections (subshell capture + here-string), the outer `> FILE`
redirect opens FILE before cat finishes, and any jq error or partial
output leaves FILE truncated to empty. Concrete failure mode: a single
jq glitch on save wipes mupiboxconfig.json, monitor.json, network.json
or data.json — the box is then unbootable until manual recovery.
Replace every occurrence with the same atomic-rename pattern:
_TMP="${TARGET}.tmp.$$"
/usr/bin/jq … > "${_TMP}" && mv "${_TMP}" "${TARGET}" || rm -f "${_TMP}"
Same-fs rename is atomic on ext4, jq's exit code now gates the swap,
and a half-written tempfile is unlinked on failure so /tmp doesn't
fill up with leftovers. For the scripts that updated multiple fields
in the same target file (setting_update.sh: 16 writes; mupi_start_led
.sh: 5 writes; get_network.sh: 9 writes), bundle the field updates
into one jq pipeline per target — fewer SD-card writes, single
truncate-race window per pass.
Files touched:
albumstop.sh, albumstop_activator.sh: monitor.json init + set
check_network.sh: network.json onlinestate (already touched in
Phase 2 for the offline filter, now atomic too)
data_clean.sh: TMP_DATA append + sed-truncate-race
get_monitor.sh: monitor.json On/Off transitions
get_network.sh: 9-field network.json bundle
m3u_generator.sh: data.json append for audiobook/music/other
mupi_start_led.sh: TMP_LEDFILE init + dim-mode toggles
repair_config.sh: SONOS_CONFIG hostname
setting_update.sh: SONOS_CONFIG (6→1) + SPOTIFYCONTROLLER_CONFIG (6→1)
set_deviceid.sh: also picks up HIGH-9 (string-test → -gt) and
HIGH-10 (sudo-redirect mismatch → /tmp staging + sudo install)
set_hostname.sh: SONOS_CONFIG hostname
While here:
HIGH-14 — `sudo echo "[]" FILE` (missing the `>` redirect, file
never created) is hit in albumstop.sh, check_network.sh,
get_network.sh, get_monitor.sh and albumstop_activator.sh.
Replace with a plain `echo -n "{}" > FILE` (no sudo: every
target lives in /tmp or under /home/dietpi/.../config/, both
directly writable by the dietpi user the script runs as) and
seed as `{}` (object) since every consumer reads these files
via `.onlinestate` / `.monitor` / `.albumStop` etc. The earlier
`sudo tee` variant produced a root-owned seed that the
subsequent jq+mv (running as dietpi) silently failed to
replace, leaving the seed wrong forever. Tighten the existence-
guard in check_network.sh / get_network.sh to also rebuild on
empty-or-wrong-shape so existing boxes recover without manual
cleanup.
HIGH-16 — data_clean.sh `[[ ! -f \${j} ]]` matched the literal-
glob fallback when the iterated cover-dir was empty (then
`rm -R "${j}"` tried to rm a path containing a literal `*`).
Restrict to actual directories with `[[ -d \${j} ]]`.
…upplicant.conf
Two distinct bugs in the wifi-add loop, both reachable from the
frontend's "save WiFi" form.
HIGH-11: `jq -r .[].ssid ${MUPIWIFI}` iterated the wlan.json array and
emitted every entry's SSID as newline-separated values. With more than
one queued WiFi (admin saves twice quickly, or a corrupt wlan.json
file with leftover entries), the SSID variable became a multi-line
string. The downstream `wpa_passphrase "${SSID}" "${PSK}"` invocation
got a multi-line first arg and wrote garbage into wpa_supplicant.conf.
Pin to `.[0].ssid // empty` so we always handle exactly one entry per
loop iteration; the existing `sudo rm ${MUPIWIFI}` at the bottom of
the iteration drains queued entries on subsequent passes.
HIGH-12: the open-network branch wrote `ssid="${SSID}"` unquoted into
wpa_supplicant.conf via tee. An SSID containing `"` or a newline would
break out of the quoted ssid value and let the caller inject arbitrary
`network={…}` blocks into a root-owned config file (a security boundary
because wpa_supplicant runs as root). Reject SSIDs with literal
newlines or NUL up front, escape `\` and `"` for the wpa_supplicant.
conf string format, and emit via printf %s rather than echo so the
shell can't reinterpret the value mid-flight. PSK-network branch was
already safe — wpa_passphrase escapes its arguments internally.
…the backup `rm "$file.bak"` errors when the .bak file doesn't exist (which is exactly the case on a fresh box's first run of optimize_wifi.sh). Today the error is only stderr noise, but if anyone later tightens this script with `set -e` the rm would abort before the cp can create the new backup — leaving no rollback path while we mutate /etc/network/interfaces and /etc/wpa_supplicant/wpa_supplicant.conf as root. `rm -f` is idempotent: silently does nothing on missing file.
… indexer
The lock-file pattern was `touch DATA_LOCK; …work…; rm DATA_LOCK` with
no signal handler. If perl, jq, or grep died mid-pass — out of disk,
malformed data.json, the box getting SIGTERM during shutdown — the
final `rm DATA_LOCK` never ran. Subsequent invocations see the stale
lock at the top of the script and exit immediately, so the index
never gets regenerated. Symptom on the box: a freshly-added audiobook
folder shows up in the cover view but doesn't get an `index` field,
which means resume / track-selection logic can't address it.
Add `trap 'rm -f "${DATA_LOCK}"' EXIT INT TERM HUP` right after touching
the lock so the cleanup runs on any script-exit path. The trailing
explicit `rm ${DATA_LOCK}` is now redundant but kept (under a comment)
to preserve the "Index is finished" log ordering people may grep for.
`jq -r .mupibox.mediaCheckTimer` on a config that doesn't have the key
yields the literal string "null". `sleep null` returns immediately
("invalid time interval"), so the while-true loop iterated as fast as
the CPU could go — a fresh-image box with no config save yet would
peg one core at 100%, ramp the fan, and waste battery. Add a guard
that substitutes 60s when CHECK_TIMER is empty or "null". 60s matches
the cadence the audit / setup-docs assume m3u_generator runs at.
…diobooks The filter that builds offline_data.json / offline_resume.json read jq '.[] | select(.type != "spotify") | select(.type != "radio" | select(.type != "rss")' Two issues: an unclosed paren after `"radio"` and a stray pipe inside the select() instead of between selects. jq exits with a parse error, the script swallows it via `$(...)` capture, and offline_data.json ends up as `[]` — the offline-mode UI then has nothing to show, even though local audiobooks are sitting on disk. Reported anecdotally by parents whose box loses Wi-Fi and "all my child's audiobooks are gone." Replace all six identical occurrences with a single combined select(): jq '.[] | select(.type != "spotify" and .type != "radio" and .type != "rss")' Same intent (drop online-only entries) but actually parses, and reads clearer than three chained selects.
The previous symlink logic only fired on a state TRANSITION between
ticks: `if [ "${ONLINESTATE}" != "${OLDSTATE}" ]`. After a process
restart while the box was already online, the loop's first iteration
saw ONLINESTATE=online with OLDSTATE=starting, entered the block, but
then hit `[[ ${OLD_ONLINESTATE} != "online" ]]` — and OLD_ONLINESTATE
was loaded once from the persisted /tmp/network.json, which already
held "online" from the previous session. Result: the inner branch
never ran, the symlink was left pointing wherever it was (often at
offline_data.json from a prior offline session), and the API served
the offline shape forever.
Visible only after Phase 3's HIGH-19 jq-filter fix made offline_data
.json actually contain (correctly-filtered, Spotify-less) entries.
Before HIGH-19 the file was empty so a stale symlink read as "no
data" — annoying but indistinguishable from the box just being slow.
Post-HIGH-19 the symlink-to-offline rendered local-only content even
when online: user reported "no Spotify albums anymore."
Replace with an idempotent ensure_symlink helper that does
"reconcile to expected target" every tick. Cheap when no-op (one
readlink syscall), self-healing when wrong, and removes the
ONLINESTATE-vs-OLDSTATE / OLD_ONLINESTATE coordination problem
entirely. The OLDSTATE comparison further down (for the network.json
.onlinestate write) is unaffected; only the symlink branch is
reworked.
…(MED-3 / LOW-1)
Two trivial bash one-liners that broke quietly enough that nobody
noticed until the audit.
MED-3: idle_shutdown.sh tested `[ "${TELEGRAM}" ]` which is bash's
"non-empty string" check. jq -r on a bool emits the literal strings
"true" or "false" — both of which are non-empty. So `telegram.active
= false` in mupiboxconfig.json was treated as "telegram is enabled"
and the box still tried to fire a Telegram notification at idle-
shutdown. Compare explicitly to "true".
LOW-1: librespot-start.sh tested `[[ ! -z "$var" ]]` where `$var`
was never set anywhere in the script. The test was therefore always
true under set -u -less defaults (treats unset as empty), so
LIBRESPOT_USERNAME got exported as `null` (jq -r's missing-key
output) even when the user had a real username, leading to authl
flailing later. Test `$username` (the actual variable) and filter
the `null` literal.
The five apt-get-touching service toggles (VNC enable/disable, Samba enable/disable, FTP enable/disable, plus an apt-get install in the FTP-enable path) all just exec'd directly. dpkg has an exclusive lock at /var/lib/dpkg/lock-frontend, so any concurrent invocation fails with "Could not get lock" — but PHP's exec discards stderr by default, the page swallows the failure, and the user gets a "Service enabled" toast for an action that didn't actually run. Wrap each apt-touching command in `flock -w 120 /tmp/.mupibox.apt.lock bash -c '<cmd>'`. flock holds an advisory lock and waits up to 120 seconds for it to be free, so a second admin click queues behind the first instead of silently failing. Lock file under /tmp so it auto-cleans on reboot. systemctl-only paths (BT-autoconnect, vnc disable's stop steps) don't go through apt and stay unwrapped.
…oter reboot lock (B2 / B4 / R3-B-5)
Three Bash + PHP cleanups around configuration plumbing that could
each silently produce broken state.
B2 — m3u_generator.sh.
- The Bash test `if [$setArtistCover == 1 ]` (no space after `[`)
was always false: bash interpreted it as a command named
`[$setArtistCover` which doesn't exist, exit 1. So the
artist-cover-only branch never fired, and audiobook artists with
a folder-level cover but no per-album cover always got the
MuPiLogo fallback. Add the space.
- Six `jq '. += [{"artist": "'"${artist}"'", …}]'` calls spliced
artist/title into JSON via shell concatenation. Any title
containing `"` (e.g. `Foo "Bar" Baz`) produced syntactically
broken JSON; jq rejected it; pre-Phase-3 (HIGH-8) the > redirect
truncated data.json to empty; post-Phase-3 the tempfile+rename
pattern leaves data.json untouched but the entry doesn't get
added. Replace all six with `--arg` so jq itself does the
JSON-encoding.
B4 — setting_update.sh.
- `ls -l ${THEME_FILE} | grep ${NEW_THEME}` was unquoted: a theme
name with a space split across grep arguments, a name with `-`
started looking like a flag. Quote both, plus `-F --` so theme
names with regex metachars match literally.
- `xargs rm <<< ${THEME_FILE}` is needlessly complex for a fixed
path; `rm -f` does the same with -f for idempotence.
- sed-injection via the hostname / timeout fields: `sed 's/.*pat.
*/.../g'` uses `/` as the delimiter, but a hostname containing
`/` would break out of the pattern. Switch to `|` as the
delimiter (extremely unlikely in a hostname or numeric value),
plus normalise hostname through `tr -dc 'A-Za-z0-9.-'` and
timeout through `tr -dc '0-9'` so injection can't land via the
config layer either.
R3-B-5 — footer.php reboot/shutdown double-click.
- `sudo su - -c "sleep 5; restart.sh &" &` had no locking. A double-
click on the reboot button (or two admin tabs both submitting
in quick succession) spawned two restart.sh processes whose
cleanup steps fought, leaving the box in a half-rebooted state.
Wrap in `( flock -n 9 || exit 0; … ) 9>/tmp/.mupibox.reboot.lock
&` so a second invocation while the first is still running
silently becomes a no-op. Same fix for the shutdown path with
a separate lockfile.
B5 (set -euo pipefail global) deferred — too many existing scripts
rely on continue-on-error patterns for best-effort cleanup; rolling
out set -e without per-script testing is a regression risk that
outweighs the linter benefit.
Two compounded bugs in scripts/mupibox/check_network.sh:289 (one of
them was masking the other, so they get fixed together):
1) Subshell pretending to be a string-test:
`if ( $(/usr/bin/python3 check_network.py) == ${TRUESTATE} )` is
a bash subshell — bash captures the python script's stdout, then
runs it as a command with `==` and `${TRUESTATE}` as arguments.
Whatever the python emits is interpreted as a command name; the
subshell returns 127, the if-condition is permanently false. The
else-branch (offline mode — symlinks data.json to offline_data.json
with spotify/radio/rss filtered out) ran every tick regardless of
network state.
2) Value mismatch hidden by the subshell:
check_network.py prints the literal "true"/"false", but the bash
variables ${TRUESTATE}/${FALSESTATE} are "online"/"offline". Even
a syntactically-correct string-test would always have been false.
The subshell bug masked this — first apparent only after the
subshell was fixed and the symlink kept pointing at offline_data.json
even with internet reachable.
Fix uses the standard string-comparison idiom against the actual
python output literal:
if [ "$(/usr/bin/python3 …)" = "true" ]; then
ONLINESTATE keeps its "online"/"offline" form so downstream consumers
of /tmp/network.json see no behaviour change.
Squashed from: ecc81a9c 2b4fa289
… (AR5-8) Without a Restart= directive, a crash leaves the service permanently dead until the next reboot. For four of these the silent-failure consequences are non-trivial: - mupi_change_checker.service: the watcher that drives Quiet-Hours and Playtime-Cap evaluation. If it crashes mid-day (bash edge case, file I/O error on /etc/mupibox/, etc.), the kid-protection guards stop firing silently — the box would play through quiet windows and ignore the daily limit until someone notices and reboots. Highest-impact service of the four. - mupi_telegram.service: the bot receiver. We saw the AR5-2/7 IndexError crash live in Phase 7 — without restart, parents' /release and /quietnow commands would just stop working without explanation. - mupi_idle_shutdown.service: the auto-shutdown timer. Silently broken means the box stays on forever, draining the battery overnight. - mupi_check_internet.service: the online/offline state driver. Silent failure leaves the box stuck in whatever state it was when the crash happened (probably "online" — masks real outages). All four get Restart=on-failure + RestartSec=10. mupi_startstop.service is intentionally excluded — it's Type=oneshot RemainAfterExit=true, so Restart= would conflict with the oneshot semantics. mupi_hat.service already has Restart=on-failure from upstream. On deploy: copy each unit to /etc/systemd/system/, run `sudo systemctl daemon-reload`, then restart-or-leave each. No need to manually restart any of them right now (they're running already).
start_mupibox_update.sh ran the destructive `rm -R Sonos-Kids- Controller-master/` before verifying the downloaded archive was usable. A flaky internet, a 404, a corrupted zip, or a kill mid- unzip would leave the box with no installation and no rollback. Pre-flight checks before any destructive op: - mupibox.zip non-empty after wget - unzip -t integrity check on mupibox.zip - MUPI_SRC dir exists after outer unzip - inner deploy.zip non-empty - /tmp/data.json backup actually landed (resume/library state is non-recoverable; cover/active_theme.css can be regenerated) Atomic swap around the destructive step: - mv old install to Sonos-Kids-Controller-master.upd-bak instead of rm -R - if the inner unzip fails, restore from .upd-bak and pm2 start the old server, then exit 1 with a clear message - only on successful extract is the .upd-bak dir removed fail_update() helper surfaces the abort to the dialog/whiptail UI so the user actually sees what went wrong instead of staring at a seemingly-stuck install.
…typo (MED-6) Two bugs in the MuPiBox auto-update script that bit users running it. (1) `rm -R /var/www/*` then `unzip` of the bundled www.zip wiped any files the admin had customised: - active_theme.css → the theme picker's selected theme - cover → symlink to /home/dietpi/MuPiBox/media/cover - theme-data/ → background images for custom themes After an update, the box reverted to the default theme and lost the cover symlink (so cover art for local audiobooks vanished from the admin UI's Media view). User then had to re-pick the theme and re-create the cover symlink manually. Mirror scripts/dev/deploy_box.sh's preserve-then-restore pattern: copy active_theme.css / cover / theme-data into a /tmp/mktemp dir before the rm, copy them back after the unzip. Belt-and-braces for the cover symlink — recreate it from scratch only if the preserve step didn't already drop a symlink in place. (2) Path typo: the script copied media files to `/home/dietpiMuPiBox/sysmedia/...` (missing slash between dietpi and MuPiBox) — a path that doesn't exist on a real box. cp emits an error to fd 3 (the script's logfile) and silently moves on, so the destination files never got refreshed during updates. Fixed all four typo'd paths to /home/dietpi/MuPiBox/sysmedia/. Also fixed `{STEP}` → `${STEP}` in the progress message that displayed the literal text "{STEP}" to the user.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Was es macht
> $tmp && mv $tmp $targetstatt> $targetmid-write)jq-Filter und Quoting in mehreren Skripten korrigiert (SSID-Sonderzeichen inadd_wifi.sh, Trap-Cleanup inadd_index.sh)change_checkerdefault-timer-Bugidle_shutdowntrue-vs-truthy-Check (Bash-Subshell-Bug aus Audit-Runde-5)apt updatemehr)Restart=on-failurefür die vier crash-relevanten Services, u.a.change_checker(schützt Quiet-Hours und Playtime, falls die Unit kippt)Architektur (kurz)
Reine Robustness-Patches — keine API/Config-Änderungen. Skripte sind weiterhin abwärtskompatibel zu vorhandenen
mupiboxconfig.json-Werten.Test in Codespaces / lokal
Die meisten Skripte sind hardware-/SD-spezifisch und auf realer Box zu testen. Was in Codespaces geht:
bash -n scripts/mupibox/*.sh(Syntax-Check, alle clean)shellcheckauf die geänderten Skripte (falls installiert) → erwartet keine neuen Warnings gegenüberupstream/mainadd_wifi.shmit Mock-SSID"O'Reilly & Friends; echo pwn"aufrufen → erwartet quoted output, kein InjectHinweis: Vollständiger Test braucht reale Box:
sudo systemctl kill -s SIGSEGV change_checker.serviceund prüfen dass die Unit von selbst startet (systemctl status→ active wieder).Offene Punkte
Restart=on-failuremitRestartSec=5und keinStartLimitBurst— falls dir das zu offen ist, gerne enger setzen.