Describe the bug
Our loki-ruler-0 entered a permanent CrashLoopBackOff. The pod starts, runs for ~10 seconds, then exits with Error. No Loki configuration was changed prior to the incident.
The ring entry for loki-ruler-0 in the distributed KV store (rulers/rulers) has a corrupted last_heartbeat_at timestamp far in the future (2034-10-29 10:37:51 UTC), while registered_at is correct (2026-04-27). The origin of this corruption is unknown — the most likely trigger is an abrupt pod termination that prevented a clean deregistration, possibly causing an integer overflow or unit mismatch during ring entry serialization. However, this is a hypothesis; no core dump or pre-crash ring snapshot was available.
Regardless of the root cause of the corruption, Loki's behavior in the presence of such an entry is the bug: the ruler treats a CAS no-change result as a fatal error and crashes, with no ability to recover or overwrite the corrupted entry.
On every subsequent restart, the ruler attempts to register itself via a CAS (Compare-And-Swap) operation. Because the corrupted entry is byte-for-byte identical to what the ruler wants to write, the KV store returns no change detected, the ruler module transitions to a Failed state, and the pod exits.
register instance in the ring: failed to CAS-update key rulers/rulers: no change detected
We tried to forget the impacted instance through the /ruler/ring endpoint from a live Loki ruler pod. The HTTP call returned successfully, but had no lasting effect: the corrupted entry reappeared immediately, re-propagated by the other cluster members before the ruler pod could re-register. Deleting and recreating the pod, as well as deleting the PVC, also had no effect.
To Reproduce
We were unable to reliably reproduce this in a controlled environment. The corruption occurred in production and we do not know the exact sequence that produced the corrupted last_heartbeat_at. However, the second half of the failure (the crash loop itself) is deterministic once the ring is in the bad state:
The following log appear several times (probably due to retry):
{"caller":"basic_lifecycler.go:322","instance":"loki-ruler-0","last_heartbeat_at":"2034-10-29 10:37:51 +0000 UTC","level":"info","msg":"instance found in the ring","registered_at":"2026-04-27 13:02:15 +0000 UTC","ring":"ruler","state":"ACTIVE","tokens":128,"ts":"2026-04-29T14:26:57.540728444Z"}
Then :
{"caller":"loki.go:631","error":"starting module ruler: invalid service state: Failed, expected: Running, failure: unable to start ruler subservices: not healthy, 0 terminated, 1 failed: [register instance in the ring: failed to CAS-update key rulers/rulers: no change detected]","level":"error","module":"ruler","msg":"module failed","ts":"2026-04-29T14:26:57.540994801Z"}
And finally :
{"caller":"log.go:223","err":"failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:673\nmain.main\n\t/src/loki/cmd/loki/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:290\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1447","level":"error","msg":"error running loki","ts":"2026-04-29T14:26:57.96926475Z"}
Expected behavior
Either:
- The ruler should detect a stale/corrupted ring entry (e.g.,
last_heartbeat_at in the future) and overwrite it unconditionally on startup, or
- The CAS logic should not treat an identical write as a fatal error — a no-op CAS on startup should be handled gracefully (e.g., treated as a successful registration), or
- Forgetting an instance in the ring should be resilient enough to perform a cluster-wide tombstone that survives gossip re-propagation.
Workaround
Changing the ruler ring key prefix forces the ruler to register under a new path, bypassing the corrupted entry entirely:
-ruler.ring.prefix=<new-prefix>/
This resolves the CrashLoopBackOff immediately but orphans the old corrupted key in the KV store.
Environment
- Loki version:
3.7.1
- Infrastructure: Kubernetes (EKS)
- Deployment tool: Helm (StatefulSet, ruler deployed as separate component)
- Ring backend: memberlist (gossip, in-memory, ~74 nodes)
Relevant configuration (anonymized)
target: ruler
ruler:
enable_sharding: true
sharding_strategy: default
sharding_algo: by-group
evaluation:
mode: remote
query_frontend:
address: dns:///loki-ruler-query-frontend.loki.svc.cluster.local.:9095
ring:
kvstore:
store: memberlist
prefix: rulers/
heartbeat_period: 5s
heartbeat_timeout: 1m0s
num_tokens: 128
instance_id: loki-ruler-0
instance_interface_names: [eth0, lo]
memberlist:
randomize_node_name: true
gossip_interval: 200ms
gossip_nodes: 3
pull_push_interval: 30s
obsolete_entries_timeout: 30s
left_ingesters_timeout: 5m0s
leave_timeout: 20s
broadcast_timeout_for_local_updates_on_shutdown: 10s
join_members: loki-memberlist.loki.svc.cluster.local
cluster_label: <cluster-label>
advertise_addr: <pod-ip>
bind_port: 7946
Describe the bug
Our
loki-ruler-0entered a permanent CrashLoopBackOff. The pod starts, runs for ~10 seconds, then exits withError. No Loki configuration was changed prior to the incident.The ring entry for
loki-ruler-0in the distributed KV store (rulers/rulers) has a corruptedlast_heartbeat_attimestamp far in the future (2034-10-29 10:37:51 UTC), whileregistered_atis correct (2026-04-27). The origin of this corruption is unknown — the most likely trigger is an abrupt pod termination that prevented a clean deregistration, possibly causing an integer overflow or unit mismatch during ring entry serialization. However, this is a hypothesis; no core dump or pre-crash ring snapshot was available.Regardless of the root cause of the corruption, Loki's behavior in the presence of such an entry is the bug: the ruler treats a CAS no-change result as a fatal error and crashes, with no ability to recover or overwrite the corrupted entry.
On every subsequent restart, the ruler attempts to register itself via a CAS (Compare-And-Swap) operation. Because the corrupted entry is byte-for-byte identical to what the ruler wants to write, the KV store returns
no change detected, the ruler module transitions to aFailedstate, and the pod exits.We tried to forget the impacted instance through the
/ruler/ringendpoint from a live Loki ruler pod. The HTTP call returned successfully, but had no lasting effect: the corrupted entry reappeared immediately, re-propagated by the other cluster members before the ruler pod could re-register. Deleting and recreating the pod, as well as deleting the PVC, also had no effect.To Reproduce
We were unable to reliably reproduce this in a controlled environment. The corruption occurred in production and we do not know the exact sequence that produced the corrupted
last_heartbeat_at. However, the second half of the failure (the crash loop itself) is deterministic once the ring is in the bad state:The following log appear several times (probably due to retry):
{"caller":"basic_lifecycler.go:322","instance":"loki-ruler-0","last_heartbeat_at":"2034-10-29 10:37:51 +0000 UTC","level":"info","msg":"instance found in the ring","registered_at":"2026-04-27 13:02:15 +0000 UTC","ring":"ruler","state":"ACTIVE","tokens":128,"ts":"2026-04-29T14:26:57.540728444Z"}Then :
{"caller":"loki.go:631","error":"starting module ruler: invalid service state: Failed, expected: Running, failure: unable to start ruler subservices: not healthy, 0 terminated, 1 failed: [register instance in the ring: failed to CAS-update key rulers/rulers: no change detected]","level":"error","module":"ruler","msg":"module failed","ts":"2026-04-29T14:26:57.540994801Z"}And finally :
{"caller":"log.go:223","err":"failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:673\nmain.main\n\t/src/loki/cmd/loki/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:290\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1447","level":"error","msg":"error running loki","ts":"2026-04-29T14:26:57.96926475Z"}Expected behavior
Either:
last_heartbeat_atin the future) and overwrite it unconditionally on startup, orWorkaround
Changing the ruler ring key prefix forces the ruler to register under a new path, bypassing the corrupted entry entirely:
This resolves the CrashLoopBackOff immediately but orphans the old corrupted key in the KV store.
Environment
3.7.1Relevant configuration (anonymized)