Skip to content

monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114

Open
nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
nmcc1212:patch-1
Open

monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114
nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
nmcc1212:patch-1

Conversation

@nmcc1212
Copy link

Problem

When a multi-node cluster loses enough nodes to include the primary, the
remaining healthy standbys can get permanently stuck in report_lsn with no
automatic recovery, requiring manual pg_autoctl drop node intervention to
escape. This affects any topology where a replication-quorum node becomes
permanently unreachable without first reporting its LSN to the monitor.

Reported in #858 and #1113.

Root cause

BuildCandidateList incremented missingNodesCount for every quorum node
that was unhealthy and not reporting — treating a permanently dead node the
same as a transiently slow one. Because ProceedGroupStateForMSFailover
refuses to proceed while missingNodesCount > 0, the failover would block
forever once any quorum node disappeared without reporting.

Solution

Separate the two kinds of "missing" nodes with a new deadMissingNodesCount
field on CandidateList:

Counter Meaning
missingNodesCount Alive nodes assigned report_lsn that haven't reported yet — we wait for these
deadMissingNodesCount Quorum nodes that are unreachable (unhealthy + not reporting) and whose LSN we can never recover

After all alive nodes have reported, the pigeonhole principle determines
whether automatic failover is safe:

  • deadMissingNodesCount < number_sync_standbys → proceed automatically.
    Every acknowledged commit required number_sync_standbys standbys to
    confirm it. With fewer than that many dead, at least one alive candidate
    must hold the most recent commit, so promoting the most-advanced alive node
    is data-safe.

  • deadMissingNodesCount >= number_sync_standbys → block and emit a clear
    log message directing the operator to run pg_autoctl drop node for each
    unreachable node. Data-safety cannot be automatically guaranteed.

  • number_sync_standbys = 0 → always proceed; no synchronous-replication
    guarantee was ever in effect.

Changes

  • src/monitor/group_state_machine.c
    • Add deadMissingNodesCount to CandidateList
    • BuildCandidateList: increment deadMissingNodesCount for unreachable quorum nodes
    • ProceedGroupStateForMSFailover: apply pigeonhole check after the
      missingNodesCount guard; log message when
      automatic failover cannot proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant