Problem
6-hour replication soak test (TestSoakReplication) consistently produces data divergence (385 value mismatches, 122 missing keys, 25 extra keys in the latest 6h run). Root cause is a dual-timeline blind window during FULLRESYNC:
- badger MVCC
db.View() reads a point-in-time snapshot at T_view
- RDB generation completes, replication offset
snapshotOffset is captured after the snapshot
- Writes committed in
T_view < t < snapshotOffset are:
- NOT in the RDB (committed after View started)
- NOT in the backlog (committed before offset capture)
Result: ~100ms-2s of writes permanently lost on each FULLRESYNC. Subsequent FULLRESYNC recovers them → eventual convergence, but no linearizable guarantee.
The soak test has been reclassified from a correctness test to a stability-only test (compareDatasets now logs divergence as informational). This debt needs to be paid.
Requirements
The mapping must guarantee:
- RDB snapshot binds an exact replication offset — every key-value pair in the RDB is stamped with the master replication offset at which it was observed
- Post-snapshot writes are fully replayable — all writes committed between snapshot offset and backlog offset are present in the backlog
- No duplicate replay — writes already in the RDB are not re-applied from backlog
TestSoakReplication data consistency assertions can be restored — once the boundary exists, the soak test can re-enable hard t.Errorf for dataset comparison
Approach
The strict fix requires tracking badger commit timestamps alongside replication offsets. Sketch:
- Add a commit-seq → repl-offset mapping table in badger (small, periodically flushed)
GenerateRDB writes the current replOffset into the RDB metadata before starting the view transaction
- On CONTINUE/PSYNC, the slave compares its
replOffset against the snapshot offset in the RDB
- Backlog head is trimmed only after confirming snapshot offset is covered
References
internal/store/define.go — key prefixes
internal/replication/psync.go — FULLRESYNC handshake and executeReplicatedCommand
internal/replication/rdb.go — GenerateRDB and SendRDB
docs/failures/snapshot-inconsistency.md — original failure postmortem
cmd/integration/soak_replication_test.go — compareDatasets (now informational)
- AGENTS.md » "FULLRESYNC Semantics (Known Limitation)" section
- AGENTS.md » "Snapshot Offset Fix History" section
Acceptance Criteria
Problem
6-hour replication soak test (
TestSoakReplication) consistently produces data divergence (385 value mismatches, 122 missing keys, 25 extra keys in the latest 6h run). Root cause is a dual-timeline blind window during FULLRESYNC:db.View()reads a point-in-time snapshot atT_viewsnapshotOffsetis captured after the snapshotT_view < t < snapshotOffsetare:Result: ~100ms-2s of writes permanently lost on each FULLRESYNC. Subsequent FULLRESYNC recovers them → eventual convergence, but no linearizable guarantee.
The soak test has been reclassified from a correctness test to a stability-only test (
compareDatasetsnow logs divergence as informational). This debt needs to be paid.Requirements
The mapping must guarantee:
TestSoakReplicationdata consistency assertions can be restored — once the boundary exists, the soak test can re-enable hardt.Errorffor dataset comparisonApproach
The strict fix requires tracking badger commit timestamps alongside replication offsets. Sketch:
GenerateRDBwrites the currentreplOffsetinto the RDB metadata before starting the view transactionreplOffsetagainst the snapshot offset in the RDBReferences
internal/store/define.go— key prefixesinternal/replication/psync.go— FULLRESYNC handshake andexecuteReplicatedCommandinternal/replication/rdb.go—GenerateRDBandSendRDBdocs/failures/snapshot-inconsistency.md— original failure postmortemcmd/integration/soak_replication_test.go—compareDatasets(now informational)Acceptance Criteria
TestRegressionSnapshotFullresyncOffsetpasses after any replication changesTestSoakReplicationcan be reverted to hard dataset comparison assertions without flaky failures