Ruler enters permanent CrashLoopBackOff when ring entry has a future `last_heartbeat_at` - CAS no-change treated as fatal

**Describe the bug**

Our `loki-ruler-0` entered a permanent CrashLoopBackOff. The pod starts, runs for ~10 seconds, then exits with `Error`. No Loki configuration was changed prior to the incident.

The ring entry for `loki-ruler-0` in the distributed KV store (`rulers/rulers`) has a corrupted `last_heartbeat_at` timestamp far in the future (`2034-10-29 10:37:51 UTC`), while `registered_at` is correct (`2026-04-27`). The origin of this corruption is unknown — the most likely trigger is an abrupt pod termination that prevented a clean deregistration, possibly causing an integer overflow or unit mismatch during ring entry serialization. However, this is a hypothesis; no core dump or pre-crash ring snapshot was available.

Regardless of the root cause of the corruption, **Loki's behavior in the presence of such an entry is the bug**: the ruler treats a CAS no-change result as a fatal error and crashes, with no ability to recover or overwrite the corrupted entry.

On every subsequent restart, the ruler attempts to register itself via a CAS (Compare-And-Swap) operation. Because the corrupted entry is byte-for-byte identical to what the ruler wants to write, the KV store returns `no change detected`, the ruler module transitions to a `Failed` state, and the pod exits.

```
register instance in the ring: failed to CAS-update key rulers/rulers: no change detected
```

We tried to forget the impacted instance through the `/ruler/ring` endpoint from a live Loki ruler pod. The HTTP call returned successfully, but had no lasting effect: the corrupted entry reappeared immediately, re-propagated by the other cluster members before the ruler pod could re-register. Deleting and recreating the pod, as well as deleting the PVC, also had no effect.


**To Reproduce**

We were unable to reliably reproduce this in a controlled environment. The corruption occurred in production and we do not know the exact sequence that produced the corrupted `last_heartbeat_at`. However, the second half of the failure (the crash loop itself) is deterministic once the ring is in the bad state:

The following log appear several times (probably due to retry): 

```json
{"caller":"basic_lifecycler.go:322","instance":"loki-ruler-0","last_heartbeat_at":"2034-10-29 10:37:51 +0000 UTC","level":"info","msg":"instance found in the ring","registered_at":"2026-04-27 13:02:15 +0000 UTC","ring":"ruler","state":"ACTIVE","tokens":128,"ts":"2026-04-29T14:26:57.540728444Z"}
``` 

Then :

```json
{"caller":"loki.go:631","error":"starting module ruler: invalid service state: Failed, expected: Running, failure: unable to start ruler subservices: not healthy, 0 terminated, 1 failed: [register instance in the ring: failed to CAS-update key rulers/rulers: no change detected]","level":"error","module":"ruler","msg":"module failed","ts":"2026-04-29T14:26:57.540994801Z"}
```

And finally : 

```json
{"caller":"log.go:223","err":"failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:673\nmain.main\n\t/src/loki/cmd/loki/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:290\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1447","level":"error","msg":"error running loki","ts":"2026-04-29T14:26:57.96926475Z"}
```

**Expected behavior**

Either:
- The ruler should detect a stale/corrupted ring entry (e.g., `last_heartbeat_at` in the future) and overwrite it unconditionally on startup, or
- The CAS logic should not treat an identical write as a fatal error — a no-op CAS on startup should be handled gracefully (e.g., treated as a successful registration), or
- Forgetting an instance in the ring should be resilient enough to perform a cluster-wide tombstone that survives gossip re-propagation.

**Workaround**

Changing the ruler ring key prefix forces the ruler to register under a new path, bypassing the corrupted entry entirely:

```
-ruler.ring.prefix=<new-prefix>/
```

This resolves the CrashLoopBackOff immediately but orphans the old corrupted key in the KV store.

**Environment**

- Loki version: `3.7.1`
- Infrastructure: Kubernetes (EKS)
- Deployment tool: Helm (StatefulSet, ruler deployed as separate component)
- Ring backend: memberlist (gossip, in-memory, ~74 nodes)

<details>
<summary>Relevant configuration (anonymized)</summary>

```yaml
target: ruler

ruler:
  enable_sharding: true
  sharding_strategy: default
  sharding_algo: by-group
  evaluation:
    mode: remote
    query_frontend:
      address: dns:///loki-ruler-query-frontend.loki.svc.cluster.local.:9095
  ring:
    kvstore:
      store: memberlist
      prefix: rulers/
    heartbeat_period: 5s
    heartbeat_timeout: 1m0s
    num_tokens: 128
    instance_id: loki-ruler-0
    instance_interface_names: [eth0, lo]

memberlist:
  randomize_node_name: true
  gossip_interval: 200ms
  gossip_nodes: 3
  pull_push_interval: 30s
  obsolete_entries_timeout: 30s
  left_ingesters_timeout: 5m0s
  leave_timeout: 20s
  broadcast_timeout_for_local_updates_on_shutdown: 10s
  join_members: loki-memberlist.loki.svc.cluster.local
  cluster_label: <cluster-label>
  advertise_addr: <pod-ip>
  bind_port: 7946
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler enters permanent CrashLoopBackOff when ring entry has a future `last_heartbeat_at` - CAS no-change treated as fatal #21733

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ruler enters permanent CrashLoopBackOff when ring entry has a future last_heartbeat_at - CAS no-change treated as fatal #21733

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Ruler enters permanent CrashLoopBackOff when ring entry has a future `last_heartbeat_at` - CAS no-change treated as fatal #21733