ring: treat CAS no-change as success during instance registration#1000
Open
Krishnachaitanyakc wants to merge 1 commit into
Open
ring: treat CAS no-change as success during instance registration#1000Krishnachaitanyakc wants to merge 1 commit into
Krishnachaitanyakc wants to merge 1 commit into
Conversation
When the KV store returns "no change detected" during instance registration, the ring entry already matches what the lifecycler tried to write. This is benign — but the current code treats it as a fatal error, which causes a permanent CrashLoopBackOff when the stored timestamp is ahead of the current time (e.g., due to clock corruption). In this scenario, the merge function never sees forward progress because time.Now() < stored_timestamp, so every CAS retry exhausts and the error surfaces as: register instance in the ring: failed to CAS-update key ...: no change detected The ruler (or any other ring member) then fails to start on every restart. Handle this gracefully: if the CAS error contains "no change detected" during registration, log a warning and proceed. The instance is already registered with the desired state; there is nothing to fix. Fixes grafana/loki#21733
|
|
1 similar comment
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the memberlist KV store returns "no change detected" during instance registration, the ring entry already matches what the lifecycler tried to write. The current code treats this as fatal, causing a permanent CrashLoopBackOff.
This happens when a ring entry has a corrupted
last_heartbeat_attimestamp far in the future. The merge function compares timestamps and never sees forward progress (time.Now() < stored_timestamp), so every CAS retry returns "no change detected." The ruler (or any ring member) then fails to start on every restart with:The fix: treat "no change detected" as a warning during registration, not a fatal error. The instance is already registered with the desired state.
Fixes grafana/loki#21733
Note: after this dskit change is merged and released, Loki will need a vendor bump to pick it up.
Note
Medium Risk
It changes startup failure semantics for ring registration and relies on substring matching of KV errors, which could mask unrelated "no change detected" failures if that text appears elsewhere.
Overview
BasicLifecycler.registerInstanceno longer fails startup when the ring KV CAS returns "no change detected". That case is treated as successful registration: the code logs a warning and continues updating local state and metrics, instead of returning an error.This avoids a restart loop when memberlist merge sees no forward progress (e.g. a
last_heartbeat_atstored ahead of wall clock) while the instance is already in the desired ring state (grafana/loki#21733). The change adds astringsimport for error substring matching only on this path; other CAS errors still propagate.Reviewed by Cursor Bugbot for commit 963b0c8. Bugbot is set up for automated code reviews on this repo. Configure here.