Skip to content

ring: treat CAS no-change as success during instance registration#1000

Open
Krishnachaitanyakc wants to merge 1 commit into
grafana:mainfrom
Krishnachaitanyakc:fix/loki-21733-cas-no-change-registration
Open

ring: treat CAS no-change as success during instance registration#1000
Krishnachaitanyakc wants to merge 1 commit into
grafana:mainfrom
Krishnachaitanyakc:fix/loki-21733-cas-no-change-registration

Conversation

@Krishnachaitanyakc

@Krishnachaitanyakc Krishnachaitanyakc commented Jun 7, 2026

Copy link
Copy Markdown

When the memberlist KV store returns "no change detected" during instance registration, the ring entry already matches what the lifecycler tried to write. The current code treats this as fatal, causing a permanent CrashLoopBackOff.

This happens when a ring entry has a corrupted last_heartbeat_at timestamp far in the future. The merge function compares timestamps and never sees forward progress (time.Now() < stored_timestamp), so every CAS retry returns "no change detected." The ruler (or any ring member) then fails to start on every restart with:

register instance in the ring: failed to CAS-update key rulers/rulers: no change detected

The fix: treat "no change detected" as a warning during registration, not a fatal error. The instance is already registered with the desired state.

Fixes grafana/loki#21733

Note: after this dskit change is merged and released, Loki will need a vendor bump to pick it up.


Note

Medium Risk
It changes startup failure semantics for ring registration and relies on substring matching of KV errors, which could mask unrelated "no change detected" failures if that text appears elsewhere.

Overview
BasicLifecycler.registerInstance no longer fails startup when the ring KV CAS returns "no change detected". That case is treated as successful registration: the code logs a warning and continues updating local state and metrics, instead of returning an error.

This avoids a restart loop when memberlist merge sees no forward progress (e.g. a last_heartbeat_at stored ahead of wall clock) while the instance is already in the desired ring state (grafana/loki#21733). The change adds a strings import for error substring matching only on this path; other CAS errors still propagate.

Reviewed by Cursor Bugbot for commit 963b0c8. Bugbot is set up for automated code reviews on this repo. Configure here.

When the KV store returns "no change detected" during instance
registration, the ring entry already matches what the lifecycler tried
to write. This is benign — but the current code treats it as a fatal
error, which causes a permanent CrashLoopBackOff when the stored
timestamp is ahead of the current time (e.g., due to clock corruption).

In this scenario, the merge function never sees forward progress because
time.Now() < stored_timestamp, so every CAS retry exhausts and the
error surfaces as:

  register instance in the ring: failed to CAS-update key ...: no change detected

The ruler (or any other ring member) then fails to start on every
restart.

Handle this gracefully: if the CAS error contains "no change detected"
during registration, log a warning and proceed. The instance is already
registered with the desired state; there is nothing to fix.

Fixes grafana/loki#21733
@cla-assistant

cla-assistant Bot commented Jun 7, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@cla-assistant

cla-assistant Bot commented Jun 7, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@cla-assistant

cla-assistant Bot commented Jun 7, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ruler enters permanent CrashLoopBackOff when ring entry has a future last_heartbeat_at - CAS no-change treated as fatal

1 participant