Fix backport of streaming WAL archiving status feature#358
Merged
NJrslv merged 1 commit intoOPENGPDB_STABLEfrom Mar 2, 2026
Merged
Fix backport of streaming WAL archiving status feature#358NJrslv merged 1 commit intoOPENGPDB_STABLEfrom
NJrslv merged 1 commit intoOPENGPDB_STABLEfrom
Conversation
The Greenplum backport [1] of the WAL archiving status reporting
feature [0] had bugs that could cause WAL loss or unnecessary
archival traffic.
ProcessArchivalReport() scanned pg_xlog/ instead of
pg_xlog/archive_status/, so it compared against actual WAL files
rather than .ready status markers. It also ignored timeline
switches: after promotion from TLI1 to TLI2, it would mark
ancestor-timeline .ready files as .done even for segments past
the switch point, which belong to the divergent old-timeline branch
and must not be touched.
The walsender side sent archival reports during startup, backup, and
stopping phases when it should only report during streaming/catchup
AND when global timeouts enabled. It also did not filter the reported
filenames to WAL segments, causing unnecessary archival traffic.
Apply fixes from V3 [2] of the upstream patch:
- Scan archive_status/ for .ready files instead of pg_xlog/
- Only mark ancestor-timeline segments .done if before the switch
point, leave divergent segments alone.
- Skip full directory scan when timeline is unchanged.
- In XLogWalRcvClose(), create .done instead of .ready when the
segment is already covered by the last archival report
- Send archival reports only during streaming/catchup phase
- Filter reported files to WAL segments only
The only new thing is the test that promotes a standby (TLI 1 to 2).
Purposely create a TLI1 wal segment and .ready file for it past the
switch point on a new standby and verify it is not incorrectly marked
as .done by TLI2 archival report.
[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] 4f2db19
[2] https://www.postgresql.org/message-id/D4B53AE3-B7AF-4BE6-9CB6-44956B05DE72%40yandex-team.ru
x4m
approved these changes
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Greenplum backport [1] of the WAL archiving status reporting feature [0] had bugs that could cause WAL loss or unnecessary archival traffic.
ProcessArchivalReport() scanned pg_xlog/ instead of pg_xlog/archive_status/, so it compared against actual WAL files rather than .ready status markers. It also ignored timeline switches [3].
The walsender side sent archival reports during startup, backup, and stopping phases when it should only report during streaming/catchup AND when global timeouts enabled. It also did not filter the reported filenames to WAL segments, causing unnecessary archival traffic.
Apply fixes from V3 [2] of the upstream patch:
The only new thing is the test that promotes a standby (TLI 1 to 2). Purposely create a TLI1 wal segment and .ready file for it past the switch point on a new standby and verify it is not incorrectly marked as .done by TLI2 archival report.
[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] 4f2db19
[2] https://www.postgresql.org/message-id/D4B53AE3-B7AF-4BE6-9CB6-44956B05DE72%40yandex-team.ru
[3] Timeline switches problem:
When timeline switches between two archive reports, stand-by could erroneously delete wal segments.
After promotion to a newer timeline, it would mark ancestor-timeline .ready files as .done even for segments
past the switch point, which belong to the divergent old-timeline branch and must not be touched.
In the example below, the switch point is segment 39. Let's say the first archival report was at segment 30 and
the next at segment 50. The prior patch would delete (first mark to .done) erroneously segments 40 and 41.