Skip to content

Fix TX stall by correcting FIFO interrupt race and missed refill conditions#44

Open
lexbrugman wants to merge 1 commit into
IndaloTech:mainfrom
lexbrugman:tx-freeze-fix
Open

Fix TX stall by correcting FIFO interrupt race and missed refill conditions#44
lexbrugman wants to merge 1 commit into
IndaloTech:mainfrom
lexbrugman:tx-freeze-fix

Conversation

@lexbrugman
Copy link
Copy Markdown

@lexbrugman lexbrugman commented Apr 22, 2026

Description

This PR addresses a TX stall issue I’ve observed during longer runs on the Indalo-Tech RAMSES_ESP

Problem

TX can occasionally stall mid-frame, leaving the system stuck in TX mode (blue LED remains on). In my setup (WiFi + MQTT enabled), this would typically occur within a few hours. When this happens, both TX and RX stop making progress.

Cause

The current TX FIFO refill logic depends on a GPIO interrupt (GDO0) to trigger further servicing. There are two related issues:

Startup ordering race
The interrupt is armed after TX FIFO priming. If priming crosses the FIFO threshold before the interrupt is enabled, the edge is missed and no refill is triggered.

Edge-only progress dependency
Even after startup, progress depends on observing a new edge. If:

  • the FIFO still requires refilling
  • but no new edge occurs

then no further refill is triggered and TX stalls.

Fix

This change addresses both issues:

Arm the interrupt before priming
Ensures that any threshold crossing during priming is not missed.

Add level-based follow-up servicing
After each refill step:

  • The current GDO0 level is checked
  • If the FIFO is still below threshold, a follow-up refill event is queued

Add startup catch-up
After priming, the current level is checked and a synthetic refill event is queued if needed.

Together, these changes ensure that TX progress no longer depends solely on observing interrupt edges.

Results

With this change:

  • I no longer observe TX stalls
  • The system runs stably for extended periods (3+ days)
  • Previously, stalls would occur reliably within a few hours

Related

I’m not entirely sure, but this commit may have been an attempt to address a similar issue: a50b2d1

After my fix I reverted this commit, which has been running without any regression. I also tested this by interrupting WiFi and restarting the MQTT server randomly, the system consistently recovered correctly using the ESP32 SDK’s built-in MQTT reconnection logic.

Notes

This change improves robustness within the current design while keeping the structure intact.

While working on this, it became clear that the TX path depends on a mix of interrupt edges and implicit assumptions about hardware state, which makes it somewhat fragile.

A more robust approach would be to move toward a fully explicit state machine where:

  • The ISR only signals "work available"
  • The task evaluates the actual hardware state
  • Progress is driven by level rather than edge history

I’d be happy to explore that further if you’re open to additional PRs, but wanted to propose this minimal fix first.

@lexbrugman lexbrugman changed the title Fix TX stall by ensuring FIFO refill continues when interrupt edge is missed Fix TX stall by correcting FIFO interrupt race and missed refill conditions Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant