Skip to content

Harden stream process restarts: exponential backoff + ALSA readiness gate#1097

Open
stamateviorel wants to merge 2 commits into
micro-nova:mainfrom
stamateviorel:fix/process-monitor-restart-hardening
Open

Harden stream process restarts: exponential backoff + ALSA readiness gate#1097
stamateviorel wants to merge 2 commits into
micro-nova:mainfrom
stamateviorel:fix/process-monitor-restart-hardening

Conversation

@stamateviorel

Copy link
Copy Markdown

What does this change intend to accomplish?

Two related robustness fixes for streams/process_monitor.py:

  1. Exponential restart backoff. The monitor restarted a crashing child in a tight zero-delay loop — a process that dies instantly (e.g. its ALSA device can't be opened) gets respawned as fast as the loop runs, flooding logs and wearing the SD card. Fast failures now back off 2s→30s, resetting once a run survives 10 seconds.

  2. ALSA readiness gate. When the monitored player exits because its loopback is still held (a not-yet-released previous instance, or the dmix state from Spotify controlled loopback getting into a bad state #957), every respawn dies instantly with EINVAL and the stream stays silent — we measured a 5.5-hour outage from one stream re-assign. If the monitored command plays to an ALSA loopback (-o lb*), the monitor now probes the device with a 1s silent aplay (the same open path the player uses) before each spawn and waits quietly until it opens. On the first failed probe it logs the current /dev/snd holders via fuser so the journal explains the wedge; after four failed probes it kills a stale leftover player that targets the same device (strict match on binary name + exact -o argument, never its own child). Commands without -o lb* (e.g. alsaloop) are completely unaffected.

Both are running in production on a real AmpliPi; the gate has already fired twice (loopback briefly held at service start) and recovered in one 2-second retry instead of crash-looping.

Checklist

  • Have you tested your changes and ensured they work? (in production; gate exercised live by holding the loopback with aplay through a service restart)
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • If applicable, have you updated the CHANGELOG?
  • Does your submission pass linting & tests? (python -m py_compile clean; happy to fix anything CI flags)

stamateviorel and others added 2 commits June 10, 2026 14:33
process_monitor restarted a crashing child in a tight zero-delay loop:
a process that dies instantly (e.g. its ALSA output device cannot be
opened) gets respawned as fast as the loop runs, flooding logs and
wearing the SD card. Track consecutive fast failures and back off
exponentially (2s..30s), resetting once a run survives 10 seconds.

Signed-off-by: Stamate Viorel <stamate.viorel@gmail.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When the monitored player exits because its loopback is still held (a
not-yet-released previous instance, or the dmix state from micro-nova#957), every
respawn dies instantly with EINVAL and the stream stays silent - we saw
a 5.5h outage from one stream re-assign. If the monitored command plays
to an ALSA loopback (-o lb*), probe the device with a 1s silent aplay
(the same open path the player uses) before each spawn and wait quietly
until it opens. On the first failed probe, log the current /dev/snd
holders via fuser so the journal explains the wedge; after four failed
probes, kill a stale leftover player process that targets the same
device (strict match on the binary name and exact -o argument, never
the monitor's own child). Processes without an -o lb* argument are
completely unaffected.

Signed-off-by: Stamate Viorel <stamate.viorel@gmail.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant