Skip to content

Add tombsweeper state to prevent joins after extended downtime #90

@wojas

Description

@wojas

PR #89 adds a 'tombsweeper' that cleans stale deletion markers after a configured period.

That PR's description contains the following important warning:

WARNING: This is DISABLED by default, because enabling this functionality requires careful consideration. When enabled, you MUST make sure that no instance that has been offline for longer than the sweeper retention_days will ever reconnect. If this does happen, old entries that have been deleted may be resurrected causing anything from undesired results to database state corruption. Be especially careful about development or testing systems that only come online occasionally.

We could reduce the risk of corruption and resurrected entries after an extended downtime by rejecting database that have not been synced for too long.

Basically this would prevent a client that has not been synced for longer than the deletion marker retention interval to sync to the store. Ro rejoin with the current LMDB contents, it would have to reload the LMDB contents from snapshots in the store.

Implementation considerations

The fact that a snapshot from just within the retention interval has been loaded does not guarantee that this contains all changes up until that timestamp. We need an extra safety buffer, say 20% of the retention interval.

Consider what would happen if the whole installation is down for a while, for example because it is a dev or demo environment, or a sloppily managed personal deployment. In this case all instances may be considered outdated. Several potential solutions:

  • Provide a commandline flag to force a join ignoring the interval.
  • Temporarily disable the sweeper in the config to disable the check? Would not work if we record sweep clean cutoffs (see below).
  • Automatically purge the local LMDB data and load from store (would require opt-in config due to destructive nature)
  • Check the timestamp of the latest snapshot, and use that to decide if it is still safe to join. This could allow one instance to join after an extended downtime, but perhaps not the rest?

Remaining issues

  • The fact that a snapshot from just within the retention interval has been loaded does not guarantee that this contains all changes up until that timestamp, neither does a an extra interval.
  • Changes to the retention interval over time are not taken into account, unless we explicitly record clean cutoffs in the LMDB. Perhaps we should.
  • You cannot tell with what retention interval the snapshots in the store have been created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions