-
Notifications
You must be signed in to change notification settings - Fork 28
Description
PR #89 adds a 'tombsweeper' that cleans stale deletion markers after a configured period.
That PR's description contains the following important warning:
WARNING: This is DISABLED by default, because enabling this functionality requires careful consideration. When enabled, you MUST make sure that no instance that has been offline for longer than the sweeper retention_days will ever reconnect. If this does happen, old entries that have been deleted may be resurrected causing anything from undesired results to database state corruption. Be especially careful about development or testing systems that only come online occasionally.
We could reduce the risk of corruption and resurrected entries after an extended downtime by rejecting database that have not been synced for too long.
Basically this would prevent a client that has not been synced for longer than the deletion marker retention interval to sync to the store. Ro rejoin with the current LMDB contents, it would have to reload the LMDB contents from snapshots in the store.
Implementation considerations
The fact that a snapshot from just within the retention interval has been loaded does not guarantee that this contains all changes up until that timestamp. We need an extra safety buffer, say 20% of the retention interval.
Consider what would happen if the whole installation is down for a while, for example because it is a dev or demo environment, or a sloppily managed personal deployment. In this case all instances may be considered outdated. Several potential solutions:
- Provide a commandline flag to force a join ignoring the interval.
- Temporarily disable the sweeper in the config to disable the check? Would not work if we record sweep clean cutoffs (see below).
- Automatically purge the local LMDB data and load from store (would require opt-in config due to destructive nature)
- Check the timestamp of the latest snapshot, and use that to decide if it is still safe to join. This could allow one instance to join after an extended downtime, but perhaps not the rest?
Remaining issues
- The fact that a snapshot from just within the retention interval has been loaded does not guarantee that this contains all changes up until that timestamp, neither does a an extra interval.
- Changes to the retention interval over time are not taken into account, unless we explicitly record clean cutoffs in the LMDB. Perhaps we should.
- You cannot tell with what retention interval the snapshots in the store have been created.