Skip to content

Consul Fails to Query Service Health - consul_up is down ~40% of time #255

@nikashnarula

Description

@nikashnarula

What did you do?
Hello, I am new to Consul and trying to understand why consul_up metric continuously fluctuates between up and down, despite all services running well (all Consul nodes are healthy and pods running). We have an alert set to trigger when consul_up is failing to be above 90% in past 5 min: (avg_over_time(consul_up{job="consul-exporter"}[5m]) * 100) < 90.

What did you expect to see?.
We expect to see consul_up give a value of 1 and be constant.

What did you see instead? Under which circumstances?
Instead, we see continuous fluctuations between consul_up being 1 (up) and 0 (down). Thus, our alert is getting triggered often even when all Consul health checks are spotless (we had a Consul support engineer verify this).
I have attached all images and log files explaining the issue.

consul_nodes_health

consul_uptime_graph

consul_uptime_value

Environment
Prod

  • System information:

    Linux 5.8.0-1041-aws x86_64

  • consul_exporter version:

    0.7.1

  • Consul version:

    Consul v1.8.0
    Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible
    agents)

  • Prometheus version:

    prometheus, version 2.28.1 (branch: HEAD, revision: b0944590a1c9a6b35dc5a696869f75f422b107a1)

  • Prometheus configuration file:

    prometheus_config.txt

  • Logs:
    prometheus_consul_exporter_logs.txt
    prometheus_logs.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions