Fix excessive health registrations and missing container handling in health-sync by dflook · Pull Request #310 · hashicorp/consul-ecs

dflook · 2026-02-03T14:49:54Z

This PR rewrites syncChecks and related functions in the health-sync command to fix two bugs:

The previous implementation handled missing containers and present containers in separate code paths with different logic, and had no change detection for dataplane health updates.
This rewrite simplifies the logic into clear phases, making it easier to verify correctness.

Current behavior

The syncChecks function has these characteristics:

handleHealthForDataplaneContainer is called at the end of every sync cycle, regardless of whether the health status has changed
Missing containers are marked critical individually, but are not included when calculating overall dataplane health -
only containers present in task metadata affect the aggregate health calculation
The dataplane container can be updated twice in a single sync cycle (once in the missing containers loop if missing, and again at the end)
Status tracking uses a map of containers that stores a mix of ECS health status and Consul health status values. These are both strings, but not semantically comparable.
The code paths for missing vs present containers have different logic
Many special cases for the dataplane container are scattered throughout the code

At scale (thousands of services), the unconditional dataplane health updates caused:

Consul leader CPU usage of 3000%+
Catalog registration latency increasing to 750ms+ (expected ~10-50ms)
Cluster marked as unhealthy by autopilot

Changes proposed in this PR:

syncChecks is now structured as three distinct phases:

Gather state - fetch container health from ECS task metadata
Compute checks - transform container health into Consul check statuses
Update Consul - send updates only for checks whose status has changed

Gather state

getContainerHealthStatuses replaces findContainersToSync. Where the old function returned two separate slices (found containers and missing containers), the new function returns a single map of container name to ECS health status.
Missing containers are explicitly marked as UNHEALTHY in this map, rather than being handled in a separate code path.
This directly fixes #300 - missing containers now affect the overall health calculation because they're represented in the same data structure as present containers.

Compute checks

computeCheckStatuses is new. It takes the container health map and produces a map of Consul check ID to Consul check status. This is where the dataplane container's special handling now lives: it maps to both the service check and the proxy check (for non-gateways), and its status is the aggregate of all container health via computeOverallDataplaneHealth.

By computing all the check IDs and their desired statuses upfront, we eliminate the need for handleHealthForDataplaneContainer. That function existed to handle the "update two checks" logic for the dataplane container, but now computeCheckStatuses simply produces both check IDs in its output map. The means other special case behaviour for the dataplane container isn't needed.

Update Consul

The update loop is now trivial: iterate over the computed check statuses, compare each to the previous status, and only call updateConsulHealthStatus if it changed. This fixes #309 - dataplane health is no longer updated unconditionally.

updateConsulHealthStatus now accepts the consul check status directly rather than an ECS health status, since the conversion happens in phase 2. This avoids mixing of health status types, and keeps conversion in one place.

Changes to setChecksCritical

setChecksCritical (used during graceful shutdown) is updated to use computeCheckStatuses for consistency. It creates a container status map with all containers marked unhealthy, calls computeCheckStatuses to get the check IDs, then updates them all to critical. This replaces its previous approach of iterating containers with special-case handling for the dataplane container.

How I've tested this PR:

Added unit tests for getContainerHealthStatuses, computeOverallDataplaneHealth, and computeCheckStatuses
Added integration tests (TestSyncChecks_ChangeDetection, TestSyncChecks_Gateway_ChangeDetection) that verify:
- Checks are updated correctly when container health changes
- Checks are not updated when health remains the same
- Missing containers are treated as unhealthy
- Recovery from unhealthy to healthy works correctly
- The check output string is correct

This change has been deployed and validated in an environment with thousands of registered services:

Consul server CPU usage dropped from 3000%+ to <20%
Network bandwidth reduced to ~10% of previous levels
Catalog registration latency returned to expected levels
Services don't start receiving traffic until all containers (including missing ones) are healthy.

The original syncChecks comment references "Consul TTL checks", but the implementation updates health
via Catalog().Register(). This PR preserves that approach. I'm not very familiar with Consul internals, but it seems
like this a referring to the Agent TTL API. If there's historical context on why catalog registration is used rather
than the Agent TTL API, it may be worth documenting.

Checklist:

Tests added
CHANGELOG entry added

PCI review checklist

I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.

Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

… container handling This rewrites syncChecks and setChecksCritical to fix two bugs: 1. Dataplane health checks were sent to Consul on every sync cycle regardless of whether status changed, causing excessive load on Consul servers (hashicorp#309) 2. Missing containers were not considered when evaluating overall dataplane health, causing traffic to be routed before services were ready (hashicorp#300) The new implementation structures syncChecks as three phases: 1. Gather state - fetch container health from ECS task metadata 2. Compute checks - transform container health into Consul check statuses 3. Update Consul - send updates only for checks whose status changed New helper functions: - getContainerHealthStatuses: replaces findContainersToSync, marks missing containers as UNHEALTHY in the same map as present containers - computeOverallDataplaneHealth: computes aggregate health - computeCheckStatuses: maps container health to Consul check IDs/statuses, handling the dataplane container's special case (service + proxy checks) setChecksCritical is updated to use computeCheckStatuses for consistency. handleHealthForDataplaneContainer is removed as its logic is now in computeCheckStatuses.

Add integration tests that verify: - Checks are updated correctly when container health changes - Checks are not updated when health remains the same (no spurious updates) - Missing containers are treated as unhealthy - Recovery from unhealthy to healthy works correctly - Gateway services work correctly (no proxy check) Tests use log output parsing to verify that only expected checks were updated, ensuring the change detection logic works correctly.

The health-sync rewrite changed the output message from referencing an ECS health check that may or may not be accurate, to a generic consul health status: Before (original): ECS health status is "HEALTHY" for container "..." After (rewrite): Consul health status is "passing" for check "..." The check output message will be improved in the future with distinct messages for missing containers, aggregate health, and graceful shutdown, but for now this just reverts to the original output messaging in a way that can be built on in the future. It makes use of a distinct checkStatus struct which should prevent mixup with ECS health status strings.

hashicorp-cla-app · 2026-02-03T14:50:09Z

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

_{Have you signed the CLA already but the status is still pending? Recheck it.}

kswap · 2026-02-26T15:10:10Z

FYI, we will be reviewing this PR by coming week

anandmukul93 · 2026-03-16T09:03:14Z

@dflook
Thanks for the PR. However, i have a few concerns -
for service + sidecar we register 3 effective checks - service + sidecar + dataplane.
Are we combining them into one health check for the service . wouldn't that be valid as dataplane drives xds updation, envoy serves as proxy and container should anyways be up.

have we reduced any no. of checks here ?

dflook · 2026-03-16T09:34:48Z

The intention is that the number of registered checks are the same as before. The only differences in functionality should be:

missing containers now cause the aggregate checks to be critical. These checks have always been critical if the missing containers had been present but unhealthy.
we no longer re-register the dataplane service if there is no change in its healthcheck

I'm not sure about the reasoning behind the original design - the two aggregate checks are the dataplane container check, and the '-sidecar-proxy' check, and they are always the same (but registered for different services in consul)

dflook added 3 commits February 2, 2026 13:45

dflook requested a review from a team as a code owner February 3, 2026 14:49

anandmukul93 mentioned this pull request Feb 23, 2026

Provide network partition configuration enablement via terraform-aws-consul-ecs automation #313

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix excessive health registrations and missing container handling in health-sync#310

Fix excessive health registrations and missing container handling in health-sync#310
dflook wants to merge 3 commits intohashicorp:mainfrom
dflook:health-sync-rewrite

dflook commented Feb 3, 2026 •

edited

Loading

Uh oh!

hashicorp-cla-app bot commented Feb 3, 2026

Uh oh!

kswap commented Feb 26, 2026

Uh oh!

anandmukul93 commented Mar 16, 2026

Uh oh!

dflook commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dflook commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current behavior

Changes proposed in this PR:

Gather state

Compute checks

Update Consul

Changes to setChecksCritical

How I've tested this PR:

Checklist:

PCI review checklist

Uh oh!

hashicorp-cla-app bot commented Feb 3, 2026

Uh oh!

kswap commented Feb 26, 2026

Uh oh!

anandmukul93 commented Mar 16, 2026

Uh oh!

dflook commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dflook commented Feb 3, 2026 •

edited

Loading

dflook commented Mar 16, 2026 •

edited

Loading