monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable by nmcc1212 · Pull Request #1114 · hapostgres/pg_auto_failover

nmcc1212 · 2026-02-24T17:03:53Z

Problem

When a multi-node cluster loses enough nodes to include the primary, the
remaining healthy standbys can get permanently stuck in report_lsn with no
automatic recovery, requiring manual pg_autoctl drop node intervention to
escape. This affects any topology where a replication-quorum node becomes
permanently unreachable without first reporting its LSN to the monitor.

Reported in #858 and #1113.

Root cause

BuildCandidateList incremented missingNodesCount for every quorum node
that was unhealthy and not reporting — treating a permanently dead node the
same as a transiently slow one. Because ProceedGroupStateForMSFailover
refuses to proceed while missingNodesCount > 0, the failover would block
forever once any quorum node disappeared without reporting.

Solution

Separate the two kinds of "missing" nodes with a new deadMissingNodesCount
field on CandidateList:

Counter	Meaning
`missingNodesCount`	Alive nodes assigned `report_lsn` that haven't reported yet — we wait for these
`deadMissingNodesCount`	Quorum nodes that are unreachable (unhealthy + not reporting) and whose LSN we can never recover

After all alive nodes have reported, the pigeonhole principle determines
whether automatic failover is safe:

deadMissingNodesCount < number_sync_standbys → proceed automatically.
Every acknowledged commit required number_sync_standbys standbys to
confirm it. With fewer than that many dead, at least one alive candidate
must hold the most recent commit, so promoting the most-advanced alive node
is data-safe.
deadMissingNodesCount >= number_sync_standbys → block and emit a clear
log message directing the operator to run pg_autoctl drop node for each
unreachable node. Data-safety cannot be automatically guaranteed.
number_sync_standbys = 0 → always proceed; no synchronous-replication
guarantee was ever in effect.

Changes

src/monitor/group_state_machine.c
- Add deadMissingNodesCount to CandidateList
- BuildCandidateList: increment deadMissingNodesCount for unreachable quorum nodes
- ProceedGroupStateForMSFailover: apply pigeonhole check after the
  missingNodesCount guard; log message when
  automatic failover cannot proceed

…reachable Fixes hapostgres#1113

monitor: fix report_lsn deadlock when quorum nodes are permanently un…

2fa4f49

…reachable Fixes hapostgres#1113

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114

monitor: fix report_lsn deadlock when quorum nodes are permanently unavailable#1114
nmcc1212 wants to merge 1 commit intohapostgres:mainfrom
nmcc1212:patch-1

nmcc1212 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nmcc1212 commented Feb 24, 2026

Problem

Root cause

Solution

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant