association: give up joining after repeated max-backoff failures#166
Merged
Conversation
After MARI_BACKOFF_MAX_STREAK (3) consecutive join failures while already pinned at the max backoff window, give up and rescan instead of waiting out the 5 s wall-clock guard (kept as a backstop). Measured in attempts so it adapts to slotframe size; ~2 s worst case on the huge schedule. AI-assisted: Claude Opus 4.8
AI-assisted: Claude Opus 4.8
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Nodes that failed to join (shared-uplink collisions, or a link/fade problem)
kept retrying the same gateway until the 5 s wall-clock timeout, which left a
long formation tail: a few stragglers taking 5-6 s while the bulk joined under 3 s.
This adds an attempt-based give-up. The node counts consecutive join failures
while already pinned at the max backoff window, and after MARI_BACKOFF_MAX_STREAK
(3) of them it gives up and rescans instead of waiting out the timeout. The 5 s
timeout stays as a wall-clock backstop. Because it is counted in attempts it
adapts to slotframe size automatically (unlike the fixed 5 s); worst case is
about 2 s on the huge schedule. The earlier rescan lets a straggler re-sync on a
fresh gateway/channel rather than hammering a possibly-faded link, which pairs
with the beacon/scan channel hopping already on develop.
Validated on a 100-node join storm: the tail tightens noticeably - p100 drops to
~3.8 s (max 4.8 s) from ~5-6 s, with p95 ~3.0 s and a much smaller spread. A
small p95 cost buys a far more predictable worst case.
MARI_BACKOFF_MAX_STREAK is the tuning knob: set too low it risks premature
rescans under normal join-storm congestion; 3 held up well in testing.
Also included: a testbed LED-color mapping so the current nRF5340-DK gateway
lights the node green (same known-gateway convention already in board.c).