Surface corrupt WAL frames instead of dropping trailing bytes#86
Surface corrupt WAL frames instead of dropping trailing bytes#86LucaCappelletti94 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new ReplicationError::TrailingBytes error variant to detect and reject unconsumed trailing bytes in both parse_wal_message and parse_wal_message_bytes, ensuring that each replication frame contains exactly one message. The feedback suggests renaming the expected and actual fields of the new error variant to consumed and total to make the API more self-documenting and intuitive, along with updating the corresponding Display implementation, parser logic, and unit tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
+ Coverage 95.51% 95.58% +0.07%
==========================================
Files 16 16
Lines 16274 16364 +90
==========================================
+ Hits 15544 15642 +98
+ Misses 730 722 -8 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Renamed to |
|
Hi @LucaCappelletti94 Thank you for your PR, but my main concern is that In normal operation each XLogData frame carries exactly one logical message (the pipeline strips the 25-byte header before parsing), so "no trailing bytes" is effectively an invariant, not a condition we expect to hit. Could you help us to measure how many perf will downgrade if we add the code? and May I know why would you need those logical in hot path? |
|
I benchmarked On why it is worth having: a downstream consumer that parses frames directly gets back only the message with no consumed-length, so today a corrupted, misframed, or injected stream is silently truncated to its first message. The checked error is what lets it notice instead of dropping it. |
158b594 to
26664c2
Compare
|
I will do some benchmark and load testing to ensure how many perf will be downgrade for from this PR first, that might take few days. |
Here is my measurements of
The control moves 2.6 percent on code the PR never touches, so anything within roughly 3 percent is drift. Every parse row is in that band, and where one path leans positive its sibling leans negative, which is layout jitter, not a real cost. |
|
As I did some benchmark and real load testing to DB, I can see perf downgress from PR86 would be higher than my expected after compared with main branch.
Perf: main vs PR86 (feat/parse-wal-message-strict)Medians. bench: main=5, PR86=5. loadtest: main=5, PR86=5. 1. Microbenchmarks (ns — lower better)
2. Load test DML TPS (higher better)
3. Efficiency — 1% CPU = N TPS
4. Summary
|
|
Hi @LucaCappelletti94, I understand some of clients who want to have rigorous checking every packet frame from replication, how about use a Env/flag to determine whether want to have rigorous checking and client will not it will lose some perf, and that flag/evn default is false. in future other checking on packet frame from replication all can rely on that env/flag to determind whether need to check. |
parse_wal_messageandparse_wal_message_bytesnow return a newReplicationError::TrailingByteswhen the input buffer carries bytes beyond a single WAL message, instead of parsing one message and silently dropping the remainder. Each replication frame carries exactly one logical message, so leftover bytes signal a corrupt or misframed input that should surface rather than pass unnoticed.This is safe for the crate's own pipeline. The stream feeds exactly one message per
XLogDataframe (the 25-byte header is stripped instream.rsbefore the parser runs), so the check never fires during correct operation. The old tolerance also had no legitimate use from the public API, since both methods return only the parsed message and no consumed-length, leaving a caller no way to notice or act on trailing bytes.Note this changes the behavior of existing public methods rather than adding new ones, so it is not purely additive. A downstream caller that passed an over-long buffer and relied on getting the first message back will now get an error. On
0.xthat is a MINOR bump.