Skip to content

promote-to-primary action can break async-replication if primary unit on secondary cluster takes over sync role first #1665

@ggouzi

Description

@ggouzi

On a dual-cluster setup with asynchronous replication (2+2), after promote-to-primary scope=unit, the secondary is stuck in Replica and doesn't go to Sync Standby.

Summary of the setup:

Member IP address Site Patroni Status
postgresql-0 10.10.128.24 Primary Sync Standby (primary cluster)
postgresql-1 10.10.128.23 Primary Leader (primary cluster)
postgresql-0 10.10.118.24 Secondary Standby Leader (secondary cluster)
postgresql-1 10.10.118.23 Secondary Replica (secondary cluster)

Steps to reproduce

  1. On primary cluster, perform a promote-to-primary action to switch secondary to primary unit
juju run postgresql/0 -- promote-to-primary scope=unit
  1. Observe patronictl output

Expected behavior

sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) -+-----------+----+-----------+
| Member       | Host         | Role         | State     | TL | Lag in MB |
+--------------+--------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.52.128.24 | Leader       | running   |  4 |           |
| postgresql-1 | 10.52.128.23 | Sync Standby | streaming |  4 |         0 |
+--------------+--------------+--------------+-----------+----+-----------+

Actual behavior

sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) --------+----+-----------+
| Member       | Host         | Role    | State     | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql-0 | 10.52.128.24 | Leader  | running   |  4 |           |
| postgresql-1 | 10.52.128.23 | Replica | streaming |  4 |         0 |
+--------------+--------------+---------+-----------+----+-----------+

Versions

Operating system: Ubuntu 24.04
Juju CLI: 3.6.14
Juju agent: 3.6.14
Charm revision: 1047
LXD: 5.21/stable

Log output

Juju debug log:

Additional context

When the promote happens, we now have 2 nodes postgresql-0 the pg_stat_replication table. First one in sync is the primary unit of the secondary cluster. Second one is the replica unit of the main cluster (the one we aimed to failover with the promote-to-primary action)

postgres=# SELECT application_name, client_addr, state, sync_state FROM pg_stat_replication;
 application_name | client_addr  |   state   | sync_state 
------------------+--------------+-----------+------------
 postgresql-0     | 10.10.118.24 | streaming | sync
 postgresql-0     | 10.10.128.24 | streaming | potential

Workaround:

  • Adjust synchronous_node_count from 1 to 2. This cannot be increased to more than 1 using juju config.
sudo charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml edit-config
  • Restart patroni on the main unit of the secondary cluster (10.10.118.24)
sudo snap start charmed-postgresql.patroni

Now the correct unit (10.10.128.24, new Leader) took over the sync state

postgres=# SELECT application_name, client_addr, state, sync_state FROM pg_stat_replication;
 application_name | client_addr  |   state   | sync_state 
------------------+--------------+-----------+------------
 postgresql-0     | 10.10.128.24 | streaming | sync
 postgresql-1     | 10.10.118.23 | streaming | async
(2 rows)

And patronictl returns proper config

sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) -+-----------+----+-----------+
| Member       | Host         | Role         | State     | TL | Lag in MB |
+--------------+--------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.10.128.24 | Leader       | running   |  4 |           |
| postgresql-1 | 10.10.128.23 | Sync Standby | streaming |  4 |         0 |
+--------------+--------------+--------------+-----------+----+-----------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions