Make the cluster-join "Ready dqlite" timeout configurable (joins fail when the dqlite DB is large)

 ### Summary
  
  `microceph cluster join` fails with `Error: Failed to join cluster: Ready dqlite: context deadline exceeded` on clusters whose microcluster **dqlite database has grown large**. The joining node cannot download + replay the dqlite state and reach "ready" within the framework's **hardcoded** ready timeout, so the join times out and rolls back.
  
  **Request:** expose a way to **increase the join/ready timeout** (CLI flag, env var, or config) for clusters with a large dqlite DB and/or slow disks — and, ideally, also let operators **bound the DB size** via dqlite snapshot params (see "Related" below).
  
  ### Environment
      
  - MicroCeph **19.2.3** (snap rev **1701**, `squid/stable`), Ceph 19.2.3 Squid
  - Ubuntu 24.04, kernel 6.8
  - Existing 3-node cluster (mon+osd), adding a **4th** storage node
  - 1 GbE management network; OSDs on SATA SSDs

  ### What happens

  On the joining node:

      $ sudo microceph cluster join <token> --microceph-ip <addr> 
      Error: Failed to join cluster: Ready dqlite: context deadline exceeded

  Joining-node daemon log (`snap.microceph.daemon`): `PreInit` → ~31s of silence → `PreRemove (force=true)` (rollback).

  Leader-side daemon log during the attempt:
  
      level=error msg="Received error sending heartbeat to cluster member"
        error="Database is still starting" target="<joiner>:7443"
      level=warning msg="Failed to get status of cluster member ... /core/1.0/ready ... connect: connection refused"

  The joiner's dqlite never finishes starting within the window.
  
  ### Root cause
         
  The microcluster **dqlite DB is ~126 MB** on every member (raft log segments):
           
      $ sudo du -sh /var/snap/microceph/common/state/database/
      126M    .../database/
      # ~25 raft segment files of 4–8 MB each (open-* and <index>-<index>)
  
  On join, the new member must receive + apply this state and reach "ready" within a **hardcoded** deadline:
  
  - `microcluster/internal/db/db.go` wraps the ready-wait in a fixed `context.WithTimeout(...)` (currently `120*time.Second` on `main`; the version vendored in MicroCeph 19.2.3 behaves as ~30s in our testing — PreInit→rollback in ~31s). There is **no flag/env/config** to change it.

  The DB is large because microcluster never sets dqlite snapshot params:
  
  - `microcluster/internal/db/dqlite.go` — both `dqlite.New(...)` calls omit `WithSnapshotParams`, so go-dqlite defaults apply (`threshold=1024`, `trailing=8192`). Canonical's own k8s-dqlite docs call these defaults ["too large for small clusters"](https://documentation.ubuntu.com/canonical-kubernetes/release-1.32/snap/reference/troubleshooting/). MicroCeph writes large config/OSD entries to raft, so 8192 trailing
  entries ≈ 126 MB.
  
  ### Why there's no workaround today

  - The join/ready timeout is a compiled-in constant — not exposed on `microceph cluster join` (`--help` has only `--microceph-ip`, `--debug`, `--verbose`, `--state-dir`), not in `ceph config`, not env-driven.
  - The dqlite trailing window is not exposed either (no `WithSnapshotParams`, no `tuning.yaml`).
  - Result: a cluster with a legitimately large dqlite DB can never add a node, with no operator-facing remedy.
  
  ### Request
  
  1. **Make the join/ready timeout configurable** — e.g. `microceph cluster join --timeout`, a daemon config key, or an env var — so large-DB / slow-disk joins can complete.
  2. **(Complementary) expose dqlite snapshot params** (threshold/trailing) so operators can bound the DB size, the same way k8s-dqlite already does via `tuning.yaml`:

         snapshot:
           trailing: 1024
           threshold: 512

     (Note k8s-dqlite's caveat: set **both** — setting `trailing` alone forces `threshold=0`, snapshotting every transaction.)

  ### Related / precedent
  
  - k8s-dqlite solved the DB-size side via `tuning.yaml` (link above).
  - canonical/microceph #476, #444, #473 (node-join failures) — same `Ready dqlite: context deadline exceeded` symptom; PR #710 fixed an address-selection variant but not the large-DB/timeout case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the cluster-join "Ready dqlite" timeout configurable (joins fail when the dqlite DB is large) #754

Summary

Environment

What happens

Root cause

Why there's no workaround today

Request

Related / precedent

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make the cluster-join "Ready dqlite" timeout configurable (joins fail when the dqlite DB is large) #754

Description

Summary

Environment

What happens

Root cause

Why there's no workaround today

Request

Related / precedent

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions