Skip to content

Make the cluster-join "Ready dqlite" timeout configurable (joins fail when the dqlite DB is large) #754

@mashanz

Description

@mashanz

Summary

microceph cluster join fails with Error: Failed to join cluster: Ready dqlite: context deadline exceeded on clusters whose microcluster dqlite database has grown large. The joining node cannot download + replay the dqlite state and reach "ready" within the framework's hardcoded ready timeout, so the join times out and rolls back.

Request: expose a way to increase the join/ready timeout (CLI flag, env var, or config) for clusters with a large dqlite DB and/or slow disks — and, ideally, also let operators bound the DB size via dqlite snapshot params (see "Related" below).

Environment

  • MicroCeph 19.2.3 (snap rev 1701, squid/stable), Ceph 19.2.3 Squid
  • Ubuntu 24.04, kernel 6.8
  • Existing 3-node cluster (mon+osd), adding a 4th storage node
  • 1 GbE management network; OSDs on SATA SSDs

What happens

On the joining node:

  $ sudo microceph cluster join <token> --microceph-ip <addr> 
  Error: Failed to join cluster: Ready dqlite: context deadline exceeded

Joining-node daemon log (snap.microceph.daemon): PreInit → ~31s of silence → PreRemove (force=true) (rollback).

Leader-side daemon log during the attempt:

  level=error msg="Received error sending heartbeat to cluster member"
    error="Database is still starting" target="<joiner>:7443"
  level=warning msg="Failed to get status of cluster member ... /core/1.0/ready ... connect: connection refused"

The joiner's dqlite never finishes starting within the window.

Root cause

The microcluster dqlite DB is ~126 MB on every member (raft log segments):

  $ sudo du -sh /var/snap/microceph/common/state/database/
  126M    .../database/
  # ~25 raft segment files of 4–8 MB each (open-* and <index>-<index>)

On join, the new member must receive + apply this state and reach "ready" within a hardcoded deadline:

  • microcluster/internal/db/db.go wraps the ready-wait in a fixed context.WithTimeout(...) (currently 120*time.Second on main; the version vendored in MicroCeph 19.2.3 behaves as ~30s in our testing — PreInit→rollback in ~31s). There is no flag/env/config to change it.

The DB is large because microcluster never sets dqlite snapshot params:

  • microcluster/internal/db/dqlite.go — both dqlite.New(...) calls omit WithSnapshotParams, so go-dqlite defaults apply (threshold=1024, trailing=8192). Canonical's own k8s-dqlite docs call these defaults "too large for small clusters". MicroCeph writes large config/OSD entries to raft, so 8192 trailing
    entries ≈ 126 MB.

Why there's no workaround today

  • The join/ready timeout is a compiled-in constant — not exposed on microceph cluster join (--help has only --microceph-ip, --debug, --verbose, --state-dir), not in ceph config, not env-driven.
  • The dqlite trailing window is not exposed either (no WithSnapshotParams, no tuning.yaml).
  • Result: a cluster with a legitimately large dqlite DB can never add a node, with no operator-facing remedy.

Request

  1. Make the join/ready timeout configurable — e.g. microceph cluster join --timeout, a daemon config key, or an env var — so large-DB / slow-disk joins can complete.

  2. (Complementary) expose dqlite snapshot params (threshold/trailing) so operators can bound the DB size, the same way k8s-dqlite already does via tuning.yaml:

    snapshot:
      trailing: 1024
      threshold: 512
    

    (Note k8s-dqlite's caveat: set both — setting trailing alone forces threshold=0, snapshotting every transaction.)

Related / precedent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions