Skip to content

Fix backport of streaming WAL archiving status feature#358

Merged
NJrslv merged 1 commit intoOPENGPDB_STABLEfrom
fix-archive-status-feature
Mar 2, 2026
Merged

Fix backport of streaming WAL archiving status feature#358
NJrslv merged 1 commit intoOPENGPDB_STABLEfrom
fix-archive-status-feature

Conversation

@NJrslv
Copy link
Contributor

@NJrslv NJrslv commented Feb 27, 2026

The Greenplum backport [1] of the WAL archiving status reporting feature [0] had bugs that could cause WAL loss or unnecessary archival traffic.

ProcessArchivalReport() scanned pg_xlog/ instead of pg_xlog/archive_status/, so it compared against actual WAL files rather than .ready status markers. It also ignored timeline switches [3].

The walsender side sent archival reports during startup, backup, and stopping phases when it should only report during streaming/catchup AND when global timeouts enabled. It also did not filter the reported filenames to WAL segments, causing unnecessary archival traffic.

Apply fixes from V3 [2] of the upstream patch:

  • Scan archive_status/ for .ready files instead of pg_xlog/
  • Only mark ancestor-timeline segments .done if before the switch point, leave divergent segments alone.
  • Skip full directory scan when timeline is unchanged.
  • In XLogWalRcvClose(), create .done instead of .ready when the segment is already covered by the last archival report
  • Send archival reports only during streaming/catchup phase
  • Filter reported files to WAL segments only

The only new thing is the test that promotes a standby (TLI 1 to 2). Purposely create a TLI1 wal segment and .ready file for it past the switch point on a new standby and verify it is not incorrectly marked as .done by TLI2 archival report.

[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] 4f2db19
[2] https://www.postgresql.org/message-id/D4B53AE3-B7AF-4BE6-9CB6-44956B05DE72%40yandex-team.ru
[3] Timeline switches problem:
When timeline switches between two archive reports, stand-by could erroneously delete wal segments.
After promotion to a newer timeline, it would mark ancestor-timeline .ready files as .done even for segments
past the switch point, which belong to the divergent old-timeline branch and must not be touched.
In the example below, the switch point is segment 39. Let's say the first archival report was at segment 30 and
the next at segment 50. The prior patch would delete (first mark to .done) erroneously segments 40 and 41.

(Timeline 1)
     /
... 39 -- 40 -- 41 (40.ready and 41.ready should not be moved to .done)
     \
     (Timeline 2)
        \
        40 -- 41 -- 42 -- 43 -- ... --- 50 (current master is here)

The Greenplum backport [1] of the WAL archiving status reporting
feature [0] had bugs that could cause WAL loss or unnecessary
archival traffic.

ProcessArchivalReport() scanned pg_xlog/ instead of
pg_xlog/archive_status/, so it compared against actual WAL files
rather than .ready status markers. It also ignored timeline
switches: after promotion from TLI1 to TLI2, it would mark
ancestor-timeline .ready files as .done even for segments past
the switch point, which belong to the divergent old-timeline branch
and must not be touched.

The walsender side sent archival reports during startup, backup, and
stopping phases when it should only report during streaming/catchup
AND when global timeouts enabled. It also did not filter the reported
filenames to WAL segments, causing unnecessary archival traffic.

Apply fixes from V3 [2] of the upstream patch:
  - Scan archive_status/ for .ready files instead of pg_xlog/
  - Only mark ancestor-timeline segments .done if before the switch
    point, leave divergent segments alone.
  - Skip full directory scan when timeline is unchanged.
  - In XLogWalRcvClose(), create .done instead of .ready when the
    segment is already covered by the last archival report
  - Send archival reports only during streaming/catchup phase
  - Filter reported files to WAL segments only

The only new thing is the test that promotes a standby (TLI 1 to 2).
Purposely create a TLI1 wal segment and .ready file for it past the
switch point on a new standby and verify it is not incorrectly marked
as .done by TLI2 archival report.

[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] 4f2db19
[2] https://www.postgresql.org/message-id/D4B53AE3-B7AF-4BE6-9CB6-44956B05DE72%40yandex-team.ru
@NJrslv NJrslv merged commit 83be7b7 into OPENGPDB_STABLE Mar 2, 2026
42 of 45 checks passed
@NJrslv NJrslv deleted the fix-archive-status-feature branch March 2, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants