monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114
Open
nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
Open
monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a multi-node cluster loses enough nodes to include the primary, the
remaining healthy standbys can get permanently stuck in
report_lsnwith noautomatic recovery, requiring manual
pg_autoctl drop nodeintervention toescape. This affects any topology where a replication-quorum node becomes
permanently unreachable without first reporting its LSN to the monitor.
Reported in #858 and #1113.
Root cause
BuildCandidateListincrementedmissingNodesCountfor every quorum nodethat was unhealthy and not reporting — treating a permanently dead node the
same as a transiently slow one. Because
ProceedGroupStateForMSFailoverrefuses to proceed while
missingNodesCount > 0, the failover would blockforever once any quorum node disappeared without reporting.
Solution
Separate the two kinds of "missing" nodes with a new
deadMissingNodesCountfield on
CandidateList:missingNodesCountreport_lsnthat haven't reported yet — we wait for thesedeadMissingNodesCountAfter all alive nodes have reported, the pigeonhole principle determines
whether automatic failover is safe:
deadMissingNodesCount < number_sync_standbys→ proceed automatically.Every acknowledged commit required
number_sync_standbysstandbys toconfirm it. With fewer than that many dead, at least one alive candidate
must hold the most recent commit, so promoting the most-advanced alive node
is data-safe.
deadMissingNodesCount >= number_sync_standbys→ block and emit a clearlog message directing the operator to run
pg_autoctl drop nodefor eachunreachable node. Data-safety cannot be automatically guaranteed.
number_sync_standbys = 0→ always proceed; no synchronous-replicationguarantee was ever in effect.
Changes
src/monitor/group_state_machine.cdeadMissingNodesCounttoCandidateListBuildCandidateList: incrementdeadMissingNodesCountfor unreachable quorum nodesProceedGroupStateForMSFailover: apply pigeonhole check after themissingNodesCountguard; log message whenautomatic failover cannot proceed