Skip to content

feat(db): make dqlite "ready" timeout configurable via DQLITE_READY_TIMEOUT#748

Draft
mashanz wants to merge 1 commit into
canonical:v4from
mashanz:configurable-dqlite-ready-timeout
Draft

feat(db): make dqlite "ready" timeout configurable via DQLITE_READY_TIMEOUT#748
mashanz wants to merge 1 commit into
canonical:v4from
mashanz:configurable-dqlite-ready-timeout

Conversation

@mashanz
Copy link
Copy Markdown

@mashanz mashanz commented Jun 3, 2026

Problem

DqliteDB.Open() waits a hardcoded 120s for dqlite to become ready (internal/db/db.go). On clusters whose dqlite database has grown large, a joining node cannot download + replay the state and reach Ready within that window, so cluster join fails:

Error: Failed to join cluster: Ready dqlite: context deadline exceeded

…while the leader logs Received error sending heartbeat … error="Database is still starting".

Concretely, we hit this on a MicroCeph (Squid) cluster: the microcluster dqlite DB had grown to ~126 MB (default raft snapshot trailing=8192 x MicroCeph's large config/OSD raft entries), and a 4th node could never join — the sync simply exceeds the timeout. (MicroCeph's dqlite.New(...) doesn't set WithSnapshotParams, so the trailing window uses go-dqlite defaults, which Canonical's own k8s-dqlite troubleshooting docs describe as "too large for small clusters".)

Change

Make the ready timeout overridable via a DQLITE_READY_TIMEOUT environment variable (a Go duration string, e.g. 30m), matching the existing env-var convention in internal/sys (DQLITE_SOCKET, STATE_DIR, …).

  • Default unchanged (120s).
  • Invalid / non-positive values are ignored with a warning via the existing db.log().Warn(...) path.

A minimal, backward-compatible lever so operators with a legitimately large dqlite DB (or a slow disk/link) can let a join complete instead of failing outright.

Notes

  • Happy to make this a typed config option instead of an env var if you'd prefer — env var was chosen to match the existing internal/sys pattern and require no API change.
  • A complementary fix would be to expose dqlite snapshot params (threshold/trailing) to bound DB growth, the way k8s-dqlite does via tuning.yaml. Glad to follow up.
  • I could not run the full CGO build locally (no libdqlite/dqlite.h toolchain); gofmt -l is clean and go build ./internal/sys/ is OK. db.go uses only symbols already imported in that file.

Related

canonical/microceph#476, #444, #473 — node-join failures with the same Ready dqlite: context deadline exceeded symptom.

…IMEOUT

Open() waits a hardcoded 2 minutes for dqlite to become ready. On clusters
whose dqlite database has grown large (e.g. the default raft snapshot trailing
of 8192 entries combined with large application entries), a joining node cannot
download and replay the state and reach ready within that window, so
`cluster join` fails with "Ready dqlite: context deadline exceeded" while the
leader logs "Database is still starting".

Allow operators to raise the wait via the DQLITE_READY_TIMEOUT environment
variable (a Go duration string, e.g. "30m"), matching the existing env-var
convention in internal/sys (DQLITE_SOCKET, STATE_DIR, ...). The default is
unchanged (120s); an invalid or non-positive value is ignored with a warning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hans <hans@dalang.io>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the dqlite “ready” wait in DqliteDB.Open() configurable via a DQLITE_READY_TIMEOUT environment variable, allowing operators to extend the startup/join window for large or slow-to-sync dqlite databases while keeping the default behavior unchanged.

Changes:

  • Add DQLITE_READY_TIMEOUT to internal/sys environment variable constants.
  • Parse DQLITE_READY_TIMEOUT as a Go duration string and use it to set the db.dqlite.Ready(...) context timeout (default remains 120s), warning and ignoring invalid/non-positive values.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
internal/sys/environment.go Adds the DQLITE_READY_TIMEOUT environment variable constant and documentation comment.
internal/db/db.go Reads/parses DQLITE_READY_TIMEOUT and uses it to control the dqlite readiness timeout with warnings on bad input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/db/db.go
// raised via the DQLITE_READY_TIMEOUT environment variable, which helps when
// a joining node must sync a large dqlite database over a slow link or disk
// and would otherwise exceed the default.
readyTimeout := 120 * time.Second
Comment thread internal/db/db.go
Comment on lines +39 to +41
db.log().Warn("Ignoring invalid DQLITE_READY_TIMEOUT", slog.String("value", v), slog.String("error", err.Error()))
case parsed <= 0:
db.log().Warn("Ignoring non-positive DQLITE_READY_TIMEOUT", slog.String("value", v))
Comment thread internal/db/db.go
Comment on lines +34 to +38
readyTimeout := 120 * time.Second
if v := os.Getenv(sys.DqliteReadyTimeout); v != "" {
parsed, err := time.ParseDuration(v)
switch {
case err != nil:
@sabaini
Copy link
Copy Markdown

sabaini commented Jun 3, 2026

Thanks @mashanz
I'm curious by your statement "MicroCeph's large config/OSD raft entries" -- what kind of large entries do you see? Just want to make sure we're not papering over some underlying issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants