fix(spotify_connect): reload provider on broken pipe errors#2997
fix(spotify_connect): reload provider on broken pipe errors#2997prydie wants to merge 1 commit intomusic-assistant:devfrom
Conversation
The librespot binary enters a "Zombie" state on broken pipe errors, staying alive unable to play audio. This patch updates the existing stderr processor to detect the "Broken pipe" error and trigger a full provider reload. This forces a complete teardown and fresh start of the provider instance.
|
Raw failure mode logs: |
|
Another set of logs from trying to start the music this morning and it exercising this patch: repro.log |
|
I don't dislike this approach, but have you tried to investigate the root cause of those broken pipes? Might be better to prevent broken pipes rather than 'turning it off and on again'? |
|
I for sure think that this could become a bandaid that removes friction that incentivises us to be more rigorous around edge case bugs. Ultimately though if we get into the state where librespot is emitting this log our only path to recovery is restarting the process whether that's manual or not. I'm running with this patch deployed at the moment and I'm exercising it a couple of times a day at least (I'll circle back and quantify that properly when I'm in front of a computer). Being practical this patch is what's making the set-up usable. Some ideas on what information would be useful to log before restarting the provider would be welcome. Especially if we can get visibility into the ffmpeg side of things. A couple of places I saw sleeps for 1s with comments about avoiding races which drew my attention but I haven't had time to reason properly about their implications. I've managed to get into states where I've had multiple librespot processes referencing the same named pipe which is definitely problematic but haven't pinpointed the proximate cause. My gut feeling is that a degree of the problems relate to how we handle switching the player/the player changing state. My set-up is physically being built out still and is pretty heterogeneous (what old/cheap speakers can I hook up to an esp32 in this space kind of thing). The churn in that kind of environment is non-trivial. I'm hoping to make some time over the weekend to take more of a look at this so any thoughts/ pointers would be great 😁 |
|
And as if I scripted it, this morning when I tried to start playback Spotify was pausing after 1-2s of playback and this is what I see on the server: Here are the MA logs: ma.log Edit: I ran the below to kill all of the |
|
I think we need to go about this differently. Those broken pipes are a result of something else that goes wrong. Simply restarting the provider just feels hacky and does not address the root cause. We need to figure out what is causing these hanging ffmpeg proceses / multiple ffmpeg processes referencing the same named pipe. I have a hunch this could have something to do with automatic switching of players. Can you try to figure out what triggers these states? Then we could solve that and prevent broken pipe errors in the first place. |
|
Marking this PR as draft so we can keep track of which PRs needs our attention. Please mark as 'Ready for review' when you want us to have another look 🙏 . |
|
I'll do my best to get back to this. That said, when I started trying to do multi-room more seriously my ESP32 S3s completely failed to stay in sync with the forced PCM on the Spotify Connect plug-in. As such, I moved to the Spotify plugin and figured out DNS / Traefik so that my wife could be persuaded off the Spotify app. |
Description
This PR fixes a "Zombie Process" issue where the
librespotbinary remains running but non-functional after a Broken Pipe (os error 32) event.The Fix:
I have extended the stderr processor to trap this specific fatal error. When detected, the provider now triggers a self-reload via
self.mass.load_provider_config(self.config).Motivation and Context
librespothandles output pipe errors by pausing rather than crashing ("Sticky Sink"). Music Assistant sees the process as healthy (PID exists), but the named pipe to FFmpeg is permanently broken.librespothandlesSIGTERMgracefully and exits withCode 0. Music Assistant interpretsExit 0as a user-initiated stop and does not restart the provider.load_provider_configforces a complete unload/load cycle, ensuring the broken pipe and zombie process are cleaned up and a fresh session is established.Steps to Reproduce
docker top music_assistant | grep ffmpegkill <PID>Verification
1. Original Failure Mode (Zombie State)
The error occurs, the player pauses, and remains unavailable indefinitely. The process does not exit.
2. With Fix (Provider reload)
The error is detected immediately, the provider reloads, and the player becomes available again.
Note: Playback stops when the pipe breaks. Once the provider reloads (approx. 15s), the user must manually re-select the device/press play to resume, but a container restart is no longer required.
Type of change