Fix TX stall by correcting FIFO interrupt race and missed refill conditions#44
Open
lexbrugman wants to merge 1 commit into
Open
Fix TX stall by correcting FIFO interrupt race and missed refill conditions#44lexbrugman wants to merge 1 commit into
lexbrugman wants to merge 1 commit into
Conversation
18a8c8b to
15eb1e9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses a TX stall issue I’ve observed during longer runs on the Indalo-Tech RAMSES_ESP
Problem
TX can occasionally stall mid-frame, leaving the system stuck in TX mode (blue LED remains on). In my setup (WiFi + MQTT enabled), this would typically occur within a few hours. When this happens, both TX and RX stop making progress.
Cause
The current TX FIFO refill logic depends on a GPIO interrupt (GDO0) to trigger further servicing. There are two related issues:
Startup ordering race
The interrupt is armed after TX FIFO priming. If priming crosses the FIFO threshold before the interrupt is enabled, the edge is missed and no refill is triggered.
Edge-only progress dependency
Even after startup, progress depends on observing a new edge. If:
then no further refill is triggered and TX stalls.
Fix
This change addresses both issues:
Arm the interrupt before priming
Ensures that any threshold crossing during priming is not missed.
Add level-based follow-up servicing
After each refill step:
Add startup catch-up
After priming, the current level is checked and a synthetic refill event is queued if needed.
Together, these changes ensure that TX progress no longer depends solely on observing interrupt edges.
Results
With this change:
Related
I’m not entirely sure, but this commit may have been an attempt to address a similar issue: a50b2d1
After my fix I reverted this commit, which has been running without any regression. I also tested this by interrupting WiFi and restarting the MQTT server randomly, the system consistently recovered correctly using the ESP32 SDK’s built-in MQTT reconnection logic.
Notes
This change improves robustness within the current design while keeping the structure intact.
While working on this, it became clear that the TX path depends on a mix of interrupt edges and implicit assumptions about hardware state, which makes it somewhat fragile.
A more robust approach would be to move toward a fully explicit state machine where:
I’d be happy to explore that further if you’re open to additional PRs, but wanted to propose this minimal fix first.