Skip to content

Implement linearizable FULLRESYNC boundary with commit-seq ↔ repl-offset mapping #3

Description

@lbp0200

Problem

6-hour replication soak test (TestSoakReplication) consistently produces data divergence (385 value mismatches, 122 missing keys, 25 extra keys in the latest 6h run). Root cause is a dual-timeline blind window during FULLRESYNC:

  • badger MVCC db.View() reads a point-in-time snapshot at T_view
  • RDB generation completes, replication offset snapshotOffset is captured after the snapshot
  • Writes committed in T_view < t < snapshotOffset are:
    • NOT in the RDB (committed after View started)
    • NOT in the backlog (committed before offset capture)

Result: ~100ms-2s of writes permanently lost on each FULLRESYNC. Subsequent FULLRESYNC recovers them → eventual convergence, but no linearizable guarantee.

The soak test has been reclassified from a correctness test to a stability-only test (compareDatasets now logs divergence as informational). This debt needs to be paid.

Requirements

The mapping must guarantee:

  1. RDB snapshot binds an exact replication offset — every key-value pair in the RDB is stamped with the master replication offset at which it was observed
  2. Post-snapshot writes are fully replayable — all writes committed between snapshot offset and backlog offset are present in the backlog
  3. No duplicate replay — writes already in the RDB are not re-applied from backlog
  4. TestSoakReplication data consistency assertions can be restored — once the boundary exists, the soak test can re-enable hard t.Errorf for dataset comparison

Approach

The strict fix requires tracking badger commit timestamps alongside replication offsets. Sketch:

  • Add a commit-seq → repl-offset mapping table in badger (small, periodically flushed)
  • GenerateRDB writes the current replOffset into the RDB metadata before starting the view transaction
  • On CONTINUE/PSYNC, the slave compares its replOffset against the snapshot offset in the RDB
  • Backlog head is trimmed only after confirming snapshot offset is covered

References

  • internal/store/define.go — key prefixes
  • internal/replication/psync.go — FULLRESYNC handshake and executeReplicatedCommand
  • internal/replication/rdb.goGenerateRDB and SendRDB
  • docs/failures/snapshot-inconsistency.md — original failure postmortem
  • cmd/integration/soak_replication_test.gocompareDatasets (now informational)
  • AGENTS.md » "FULLRESYNC Semantics (Known Limitation)" section
  • AGENTS.md » "Snapshot Offset Fix History" section

Acceptance Criteria

  • TestRegressionSnapshotFullresyncOffset passes after any replication changes
  • FULLRESYNC does not lose writes in a controlled test (write N keys, trigger FULLRESYNC, verify all N present on slave)
  • TestSoakReplication can be reverted to hard dataset comparison assertions without flaky failures

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions