Skip to content

Ruler enters permanent CrashLoopBackOff when ring entry has a future last_heartbeat_at - CAS no-change treated as fatal #21733

@thomas-gouveia

Description

@thomas-gouveia

Describe the bug

Our loki-ruler-0 entered a permanent CrashLoopBackOff. The pod starts, runs for ~10 seconds, then exits with Error. No Loki configuration was changed prior to the incident.

The ring entry for loki-ruler-0 in the distributed KV store (rulers/rulers) has a corrupted last_heartbeat_at timestamp far in the future (2034-10-29 10:37:51 UTC), while registered_at is correct (2026-04-27). The origin of this corruption is unknown — the most likely trigger is an abrupt pod termination that prevented a clean deregistration, possibly causing an integer overflow or unit mismatch during ring entry serialization. However, this is a hypothesis; no core dump or pre-crash ring snapshot was available.

Regardless of the root cause of the corruption, Loki's behavior in the presence of such an entry is the bug: the ruler treats a CAS no-change result as a fatal error and crashes, with no ability to recover or overwrite the corrupted entry.

On every subsequent restart, the ruler attempts to register itself via a CAS (Compare-And-Swap) operation. Because the corrupted entry is byte-for-byte identical to what the ruler wants to write, the KV store returns no change detected, the ruler module transitions to a Failed state, and the pod exits.

register instance in the ring: failed to CAS-update key rulers/rulers: no change detected

We tried to forget the impacted instance through the /ruler/ring endpoint from a live Loki ruler pod. The HTTP call returned successfully, but had no lasting effect: the corrupted entry reappeared immediately, re-propagated by the other cluster members before the ruler pod could re-register. Deleting and recreating the pod, as well as deleting the PVC, also had no effect.

To Reproduce

We were unable to reliably reproduce this in a controlled environment. The corruption occurred in production and we do not know the exact sequence that produced the corrupted last_heartbeat_at. However, the second half of the failure (the crash loop itself) is deterministic once the ring is in the bad state:

The following log appear several times (probably due to retry):

{"caller":"basic_lifecycler.go:322","instance":"loki-ruler-0","last_heartbeat_at":"2034-10-29 10:37:51 +0000 UTC","level":"info","msg":"instance found in the ring","registered_at":"2026-04-27 13:02:15 +0000 UTC","ring":"ruler","state":"ACTIVE","tokens":128,"ts":"2026-04-29T14:26:57.540728444Z"}

Then :

{"caller":"loki.go:631","error":"starting module ruler: invalid service state: Failed, expected: Running, failure: unable to start ruler subservices: not healthy, 0 terminated, 1 failed: [register instance in the ring: failed to CAS-update key rulers/rulers: no change detected]","level":"error","module":"ruler","msg":"module failed","ts":"2026-04-29T14:26:57.540994801Z"}

And finally :

{"caller":"log.go:223","err":"failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:673\nmain.main\n\t/src/loki/cmd/loki/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:290\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1447","level":"error","msg":"error running loki","ts":"2026-04-29T14:26:57.96926475Z"}

Expected behavior

Either:

  • The ruler should detect a stale/corrupted ring entry (e.g., last_heartbeat_at in the future) and overwrite it unconditionally on startup, or
  • The CAS logic should not treat an identical write as a fatal error — a no-op CAS on startup should be handled gracefully (e.g., treated as a successful registration), or
  • Forgetting an instance in the ring should be resilient enough to perform a cluster-wide tombstone that survives gossip re-propagation.

Workaround

Changing the ruler ring key prefix forces the ruler to register under a new path, bypassing the corrupted entry entirely:

-ruler.ring.prefix=<new-prefix>/

This resolves the CrashLoopBackOff immediately but orphans the old corrupted key in the KV store.

Environment

  • Loki version: 3.7.1
  • Infrastructure: Kubernetes (EKS)
  • Deployment tool: Helm (StatefulSet, ruler deployed as separate component)
  • Ring backend: memberlist (gossip, in-memory, ~74 nodes)
Relevant configuration (anonymized)
target: ruler

ruler:
  enable_sharding: true
  sharding_strategy: default
  sharding_algo: by-group
  evaluation:
    mode: remote
    query_frontend:
      address: dns:///loki-ruler-query-frontend.loki.svc.cluster.local.:9095
  ring:
    kvstore:
      store: memberlist
      prefix: rulers/
    heartbeat_period: 5s
    heartbeat_timeout: 1m0s
    num_tokens: 128
    instance_id: loki-ruler-0
    instance_interface_names: [eth0, lo]

memberlist:
  randomize_node_name: true
  gossip_interval: 200ms
  gossip_nodes: 3
  pull_push_interval: 30s
  obsolete_entries_timeout: 30s
  left_ingesters_timeout: 5m0s
  leave_timeout: 20s
  broadcast_timeout_for_local_updates_on_shutdown: 10s
  join_members: loki-memberlist.loki.svc.cluster.local
  cluster_label: <cluster-label>
  advertise_addr: <pod-ip>
  bind_port: 7946

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions