Skip to content

Fix excessive health registrations and missing container handling in health-sync#310

Open
dflook wants to merge 3 commits intohashicorp:mainfrom
dflook:health-sync-rewrite
Open

Fix excessive health registrations and missing container handling in health-sync#310
dflook wants to merge 3 commits intohashicorp:mainfrom
dflook:health-sync-rewrite

Conversation

@dflook
Copy link

@dflook dflook commented Feb 3, 2026

This PR rewrites syncChecks and related functions in the health-sync command to fix two bugs:

  1. health-sync sends excessive catalog registrations for dataplane container health #309
  2. ECS Health Sync Sending Traffic to Missing Unhealthy Containers #300

The previous implementation handled missing containers and present containers in separate code paths with different logic, and had no change detection for dataplane health updates.
This rewrite simplifies the logic into clear phases, making it easier to verify correctness.

Current behavior

The syncChecks function has these characteristics:

  • handleHealthForDataplaneContainer is called at the end of every sync cycle, regardless of whether the health status has changed
  • Missing containers are marked critical individually, but are not included when calculating overall dataplane health -
    only containers present in task metadata affect the aggregate health calculation
  • The dataplane container can be updated twice in a single sync cycle (once in the missing containers loop if missing, and again at the end)
  • Status tracking uses a map of containers that stores a mix of ECS health status and Consul health status values. These are both strings, but not semantically comparable.
  • The code paths for missing vs present containers have different logic
  • Many special cases for the dataplane container are scattered throughout the code

At scale (thousands of services), the unconditional dataplane health updates caused:

  • Consul leader CPU usage of 3000%+
  • Catalog registration latency increasing to 750ms+ (expected ~10-50ms)
  • Cluster marked as unhealthy by autopilot

Changes proposed in this PR:

syncChecks is now structured as three distinct phases:

  1. Gather state - fetch container health from ECS task metadata
  2. Compute checks - transform container health into Consul check statuses
  3. Update Consul - send updates only for checks whose status has changed

Gather state

getContainerHealthStatuses replaces findContainersToSync. Where the old function returned two separate slices (found containers and missing containers), the new function returns a single map of container name to ECS health status.
Missing containers are explicitly marked as UNHEALTHY in this map, rather than being handled in a separate code path.
This directly fixes #300 - missing containers now affect the overall health calculation because they're represented in the same data structure as present containers.

Compute checks

computeCheckStatuses is new. It takes the container health map and produces a map of Consul check ID to Consul check status. This is where the dataplane container's special handling now lives: it maps to both the service check and the proxy check (for non-gateways), and its status is the aggregate of all container health via computeOverallDataplaneHealth.

By computing all the check IDs and their desired statuses upfront, we eliminate the need for handleHealthForDataplaneContainer. That function existed to handle the "update two checks" logic for the dataplane container, but now computeCheckStatuses simply produces both check IDs in its output map. The means other special case behaviour for the dataplane container isn't needed.

Update Consul

The update loop is now trivial: iterate over the computed check statuses, compare each to the previous status, and only call updateConsulHealthStatus if it changed. This fixes #309 - dataplane health is no longer updated unconditionally.

updateConsulHealthStatus now accepts the consul check status directly rather than an ECS health status, since the conversion happens in phase 2. This avoids mixing of health status types, and keeps conversion in one place.

Changes to setChecksCritical

setChecksCritical (used during graceful shutdown) is updated to use computeCheckStatuses for consistency. It creates a container status map with all containers marked unhealthy, calls computeCheckStatuses to get the check IDs, then updates them all to critical. This replaces its previous approach of iterating containers with special-case handling for the dataplane container.

How I've tested this PR:

  • Added unit tests for getContainerHealthStatuses, computeOverallDataplaneHealth, and computeCheckStatuses
  • Added integration tests (TestSyncChecks_ChangeDetection, TestSyncChecks_Gateway_ChangeDetection) that verify:
    • Checks are updated correctly when container health changes
    • Checks are not updated when health remains the same
    • Missing containers are treated as unhealthy
    • Recovery from unhealthy to healthy works correctly
    • The check output string is correct

This change has been deployed and validated in an environment with thousands of registered services:

  • Consul server CPU usage dropped from 3000%+ to <20%
  • Network bandwidth reduced to ~10% of previous levels
  • Catalog registration latency returned to expected levels
  • Services don't start receiving traffic until all containers (including missing ones) are healthy.

The original syncChecks comment references "Consul TTL checks", but the implementation updates health
via Catalog().Register(). This PR preserves that approach. I'm not very familiar with Consul internals, but it seems
like this a referring to the Agent TTL API. If there's historical context on why catalog registration is used rather
than the Agent TTL API, it may be worth documenting.

Checklist:

  • Tests added
  • CHANGELOG entry added

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

… container handling

This rewrites syncChecks and setChecksCritical to fix two bugs:

1. Dataplane health checks were sent to Consul on every sync cycle regardless
   of whether status changed, causing excessive load on Consul servers (hashicorp#309)
2. Missing containers were not considered when evaluating overall dataplane
   health, causing traffic to be routed before services were ready (hashicorp#300)

The new implementation structures syncChecks as three phases:
1. Gather state - fetch container health from ECS task metadata
2. Compute checks - transform container health into Consul check statuses
3. Update Consul - send updates only for checks whose status changed

New helper functions:
- getContainerHealthStatuses: replaces findContainersToSync, marks missing
  containers as UNHEALTHY in the same map as present containers
- computeOverallDataplaneHealth: computes aggregate health
- computeCheckStatuses: maps container health to Consul check IDs/statuses,
  handling the dataplane container's special case (service + proxy checks)

setChecksCritical is updated to use computeCheckStatuses for consistency.

handleHealthForDataplaneContainer is removed as its logic is now in
computeCheckStatuses.
Add integration tests that verify:
- Checks are updated correctly when container health changes
- Checks are not updated when health remains the same (no spurious updates)
- Missing containers are treated as unhealthy
- Recovery from unhealthy to healthy works correctly
- Gateway services work correctly (no proxy check)

Tests use log output parsing to verify that only expected checks were updated,
ensuring the change detection logic works correctly.
The health-sync rewrite changed the output message from referencing an ECS health check
that may or may not be accurate, to a generic consul health status:

Before (original): ECS health status is "HEALTHY" for container "..."
After (rewrite):   Consul health status is "passing" for check "..."

The check output message will be improved in the future with distinct messages for missing containers, aggregate health, and graceful shutdown, but for now this just reverts to the original output messaging in a way that can be built on in the future.

It makes use of a distinct checkStatus struct which should prevent mixup with ECS health status strings.
@dflook dflook requested a review from a team as a code owner February 3, 2026 14:49
@hashicorp-cla-app
Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

@kswap
Copy link
Contributor

kswap commented Feb 26, 2026

FYI, we will be reviewing this PR by coming week

@anandmukul93
Copy link
Contributor

@dflook
Thanks for the PR. However, i have a few concerns -
for service + sidecar we register 3 effective checks - service + sidecar + dataplane.
Are we combining them into one health check for the service . wouldn't that be valid as dataplane drives xds updation, envoy serves as proxy and container should anyways be up.

have we reduced any no. of checks here ?

@dflook
Copy link
Author

dflook commented Mar 16, 2026

The intention is that the number of registered checks are the same as before. The only differences in functionality should be:

  • missing containers now cause the aggregate checks to be critical. These checks have always been critical if the missing containers had been present but unhealthy.
  • we no longer re-register the dataplane service if there is no change in its healthcheck

I'm not sure about the reasoning behind the original design - the two aggregate checks are the dataplane container check, and the '-sidecar-proxy' check, and they are always the same (but registered for different services in consul)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

health-sync sends excessive catalog registrations for dataplane container health ECS Health Sync Sending Traffic to Missing Unhealthy Containers

3 participants