On a dual-cluster setup with asynchronous replication (2+2), after promote-to-primary scope=unit, the secondary is stuck in Replica and doesn't go to Sync Standby.
Summary of the setup:
| Member |
IP address |
Site |
Patroni Status |
| postgresql-0 |
10.10.128.24 |
Primary |
Sync Standby (primary cluster) |
| postgresql-1 |
10.10.128.23 |
Primary |
Leader (primary cluster) |
| postgresql-0 |
10.10.118.24 |
Secondary |
Standby Leader (secondary cluster) |
| postgresql-1 |
10.10.118.23 |
Secondary |
Replica (secondary cluster) |
Steps to reproduce
- On primary cluster, perform a
promote-to-primary action to switch secondary to primary unit
juju run postgresql/0 -- promote-to-primary scope=unit
- Observe patronictl output
Expected behavior
sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) -+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------+--------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.52.128.24 | Leader | running | 4 | |
| postgresql-1 | 10.52.128.23 | Sync Standby | streaming | 4 | 0 |
+--------------+--------------+--------------+-----------+----+-----------+
Actual behavior
sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) --------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql-0 | 10.52.128.24 | Leader | running | 4 | |
| postgresql-1 | 10.52.128.23 | Replica | streaming | 4 | 0 |
+--------------+--------------+---------+-----------+----+-----------+
Versions
Operating system: Ubuntu 24.04
Juju CLI: 3.6.14
Juju agent: 3.6.14
Charm revision: 1047
LXD: 5.21/stable
Log output
Juju debug log:
Additional context
When the promote happens, we now have 2 nodes postgresql-0 the pg_stat_replication table. First one in sync is the primary unit of the secondary cluster. Second one is the replica unit of the main cluster (the one we aimed to failover with the promote-to-primary action)
postgres=# SELECT application_name, client_addr, state, sync_state FROM pg_stat_replication;
application_name | client_addr | state | sync_state
------------------+--------------+-----------+------------
postgresql-0 | 10.10.118.24 | streaming | sync
postgresql-0 | 10.10.128.24 | streaming | potential
Workaround:
- Adjust
synchronous_node_count from 1 to 2. This cannot be increased to more than 1 using juju config.
sudo charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml edit-config
- Restart patroni on the main unit of the secondary cluster (10.10.118.24)
sudo snap start charmed-postgresql.patroni
Now the correct unit (10.10.128.24, new Leader) took over the sync state
postgres=# SELECT application_name, client_addr, state, sync_state FROM pg_stat_replication;
application_name | client_addr | state | sync_state
------------------+--------------+-----------+------------
postgresql-0 | 10.10.128.24 | streaming | sync
postgresql-1 | 10.10.118.23 | streaming | async
(2 rows)
And patronictl returns proper config
sudo patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7633847608209757521) -+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------+--------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.10.128.24 | Leader | running | 4 | |
| postgresql-1 | 10.10.128.23 | Sync Standby | streaming | 4 | 0 |
+--------------+--------------+--------------+-----------+----+-----------+
On a dual-cluster setup with asynchronous replication (2+2), after
promote-to-primary scope=unit, the secondary is stuck in Replica and doesn't go to Sync Standby.Summary of the setup:
Steps to reproduce
promote-to-primaryaction to switch secondary to primary unitExpected behavior
Actual behavior
Versions
Operating system: Ubuntu 24.04
Juju CLI: 3.6.14
Juju agent: 3.6.14
Charm revision: 1047
LXD: 5.21/stable
Log output
Juju debug log:
Additional context
When the promote happens, we now have 2 nodes
postgresql-0the pg_stat_replication table. First one insyncis the primary unit of the secondary cluster. Second one is the replica unit of the main cluster (the one we aimed to failover with thepromote-to-primaryaction)Workaround:
synchronous_node_countfrom1to2. This cannot be increased to more than 1 usingjuju config.Now the correct unit (10.10.128.24, new Leader) took over the
syncstateAnd patronictl returns proper config