feat(db): make dqlite "ready" timeout configurable via DQLITE_READY_TIMEOUT#748
Draft
mashanz wants to merge 1 commit into
Draft
feat(db): make dqlite "ready" timeout configurable via DQLITE_READY_TIMEOUT#748mashanz wants to merge 1 commit into
mashanz wants to merge 1 commit into
Conversation
…IMEOUT Open() waits a hardcoded 2 minutes for dqlite to become ready. On clusters whose dqlite database has grown large (e.g. the default raft snapshot trailing of 8192 entries combined with large application entries), a joining node cannot download and replay the state and reach ready within that window, so `cluster join` fails with "Ready dqlite: context deadline exceeded" while the leader logs "Database is still starting". Allow operators to raise the wait via the DQLITE_READY_TIMEOUT environment variable (a Go duration string, e.g. "30m"), matching the existing env-var convention in internal/sys (DQLITE_SOCKET, STATE_DIR, ...). The default is unchanged (120s); an invalid or non-positive value is ignored with a warning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hans <hans@dalang.io>
There was a problem hiding this comment.
Pull request overview
This PR makes the dqlite “ready” wait in DqliteDB.Open() configurable via a DQLITE_READY_TIMEOUT environment variable, allowing operators to extend the startup/join window for large or slow-to-sync dqlite databases while keeping the default behavior unchanged.
Changes:
- Add
DQLITE_READY_TIMEOUTtointernal/sysenvironment variable constants. - Parse
DQLITE_READY_TIMEOUTas a Go duration string and use it to set thedb.dqlite.Ready(...)context timeout (default remains 120s), warning and ignoring invalid/non-positive values.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| internal/sys/environment.go | Adds the DQLITE_READY_TIMEOUT environment variable constant and documentation comment. |
| internal/db/db.go | Reads/parses DQLITE_READY_TIMEOUT and uses it to control the dqlite readiness timeout with warnings on bad input. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // raised via the DQLITE_READY_TIMEOUT environment variable, which helps when | ||
| // a joining node must sync a large dqlite database over a slow link or disk | ||
| // and would otherwise exceed the default. | ||
| readyTimeout := 120 * time.Second |
Comment on lines
+39
to
+41
| db.log().Warn("Ignoring invalid DQLITE_READY_TIMEOUT", slog.String("value", v), slog.String("error", err.Error())) | ||
| case parsed <= 0: | ||
| db.log().Warn("Ignoring non-positive DQLITE_READY_TIMEOUT", slog.String("value", v)) |
Comment on lines
+34
to
+38
| readyTimeout := 120 * time.Second | ||
| if v := os.Getenv(sys.DqliteReadyTimeout); v != "" { | ||
| parsed, err := time.ParseDuration(v) | ||
| switch { | ||
| case err != nil: |
|
Thanks @mashanz |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DqliteDB.Open()waits a hardcoded 120s for dqlite to become ready (internal/db/db.go). On clusters whose dqlite database has grown large, a joining node cannot download + replay the state and reachReadywithin that window, socluster joinfails:…while the leader logs
Received error sending heartbeat … error="Database is still starting".Concretely, we hit this on a MicroCeph (Squid) cluster: the microcluster dqlite DB had grown to ~126 MB (default raft snapshot
trailing=8192x MicroCeph's large config/OSD raft entries), and a 4th node could never join — the sync simply exceeds the timeout. (MicroCeph'sdqlite.New(...)doesn't setWithSnapshotParams, so the trailing window uses go-dqlite defaults, which Canonical's own k8s-dqlite troubleshooting docs describe as "too large for small clusters".)Change
Make the ready timeout overridable via a
DQLITE_READY_TIMEOUTenvironment variable (a Go duration string, e.g.30m), matching the existing env-var convention ininternal/sys(DQLITE_SOCKET,STATE_DIR, …).db.log().Warn(...)path.A minimal, backward-compatible lever so operators with a legitimately large dqlite DB (or a slow disk/link) can let a join complete instead of failing outright.
Notes
internal/syspattern and require no API change.threshold/trailing) to bound DB growth, the way k8s-dqlite does viatuning.yaml. Glad to follow up.libdqlite/dqlite.htoolchain);gofmt -lis clean andgo build ./internal/sys/is OK.db.gouses only symbols already imported in that file.Related
canonical/microceph#476, #444, #473 — node-join failures with the same
Ready dqlite: context deadline exceededsymptom.