peer+lnd: add new CLI option to control if we D/C on slow pongs#9801
peer+lnd: add new CLI option to control if we D/C on slow pongs#9801Roasbeef merged 2 commits intolightningnetwork:masterfrom
Conversation
|
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
eeb7897 to
8c23404
Compare
|
tACK, this PR improves connection reliability with my peers compared to just Some slight weirdness that I figured would be good to document: Pong size did not match expected size: Back-to-back repeated Pongs (remote is CLN): |
guggero
left a comment
There was a problem hiding this comment.
I think the flag could be useful!
|
I propose adding a Connection-Failure-Treshold (N) which only disconnects after several failed ping/pongs in a row rather than just adding a connection flag wdyt ? |
I had also considered making it a sort of peer level healthcheck, to inherit that threshold logic, but instead went in this direction as I started to second guess the design rationale in disconnecting in the first place. I think if we add a threshold flag, then we'd also want to add a flag to tune what the timeout value should be. When I started to run this on my node (even w/ the super prio queue), I noticed some nodes that were just persistently slow in replying. Ultimately slow nodes do affect payment latency e2e. There's a credible design direction here where we start to factor it in at the first hop level, but then also have the link sample the ping RTT of a peer, and decide if the link is even eligible to send based on that. |
ellemouton
left a comment
There was a problem hiding this comment.
the last commit does not change existing behaviour afaict
3c75628 to
f0fcffa
Compare
|
The latest version inverts the original PR: default stays, but users have an option to turn off the disconnect behavior. |
1f498d3 to
d6d25a9
Compare
guggero
left a comment
There was a problem hiding this comment.
Pushed up a fix for the sample config and also made the behavior of return vs. continue in the ping manager more clear.
With that, LGTM 🎉
ellemouton
left a comment
There was a problem hiding this comment.
just one comment - otherwise lgtm!
In this commit, we add a new CLI option to control if we D/C on slow pongs or not. Due to the existence of head-of-the-line blocking at various levels of abstraction (app buffer, slow processing, TCP kernel buffers, etc), if there's a flurry of gossip messages (eg: 1K channel updates), then even with a reasonable processing latency, a peer may still not read our ping in time. To give users another option, we add a flag that allows users to disable this behavior. The default remains.
| // If NoDisconnectOnPongFailure is true, we don't | ||
| // disconnect. Otherwise (if it's false, the default), | ||
| // we disconnect. |
There was a problem hiding this comment.
Nit: Comment can be removed, describes the code.
| } | ||
|
|
||
| // getLastRTT safely retrieves the last known RTT, returning 0 if none exists. | ||
| func (m *PingManager) getLastRTT() time.Duration { |
There was a problem hiding this comment.
Nit: Use fn.Option here as well similar as below pendingPingWait ?
| close(pingSent) | ||
| }) | ||
| }, | ||
| OnPongFailure: func(err error, |
There was a problem hiding this comment.
Nit: should be protected by a Once struct maybe as well, because it is closing a channel ?
In this commit, we add a new CLI option to control if we D/C on slow pongs or not. Due to the existence of head-of-the-line blocking at various levels of abstraction (app buffer, slow processing, TCP kernel buffers, etc), if there's a flurry of gossip messages (eg: 1K channel updates), then even with a reasonable processing latency, a peer may still not read our ping in time.
To combat this, we change the default behavior to just logging for slow pongs, and add a new CLI option to re-enable the old behavior.
Along the way, we also add some more enhanced logging, so we can tell when the last successful ping was, and also the deadline reached.