Fix excessive health registrations and missing container handling in health-sync#310
Fix excessive health registrations and missing container handling in health-sync#310dflook wants to merge 3 commits intohashicorp:mainfrom
Conversation
… container handling This rewrites syncChecks and setChecksCritical to fix two bugs: 1. Dataplane health checks were sent to Consul on every sync cycle regardless of whether status changed, causing excessive load on Consul servers (hashicorp#309) 2. Missing containers were not considered when evaluating overall dataplane health, causing traffic to be routed before services were ready (hashicorp#300) The new implementation structures syncChecks as three phases: 1. Gather state - fetch container health from ECS task metadata 2. Compute checks - transform container health into Consul check statuses 3. Update Consul - send updates only for checks whose status changed New helper functions: - getContainerHealthStatuses: replaces findContainersToSync, marks missing containers as UNHEALTHY in the same map as present containers - computeOverallDataplaneHealth: computes aggregate health - computeCheckStatuses: maps container health to Consul check IDs/statuses, handling the dataplane container's special case (service + proxy checks) setChecksCritical is updated to use computeCheckStatuses for consistency. handleHealthForDataplaneContainer is removed as its logic is now in computeCheckStatuses.
Add integration tests that verify: - Checks are updated correctly when container health changes - Checks are not updated when health remains the same (no spurious updates) - Missing containers are treated as unhealthy - Recovery from unhealthy to healthy works correctly - Gateway services work correctly (no proxy check) Tests use log output parsing to verify that only expected checks were updated, ensuring the change detection logic works correctly.
The health-sync rewrite changed the output message from referencing an ECS health check that may or may not be accurate, to a generic consul health status: Before (original): ECS health status is "HEALTHY" for container "..." After (rewrite): Consul health status is "passing" for check "..." The check output message will be improved in the future with distinct messages for missing containers, aggregate health, and graceful shutdown, but for now this just reverts to the original output messaging in a way that can be built on in the future. It makes use of a distinct checkStatus struct which should prevent mixup with ECS health status strings.
|
Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement Learn more about why HashiCorp requires a CLA and what the CLA includes Have you signed the CLA already but the status is still pending? Recheck it. |
|
FYI, we will be reviewing this PR by coming week |
|
@dflook have we reduced any no. of checks here ? |
|
The intention is that the number of registered checks are the same as before. The only differences in functionality should be:
I'm not sure about the reasoning behind the original design - the two aggregate checks are the dataplane container check, and the '-sidecar-proxy' check, and they are always the same (but registered for different services in consul) |
This PR rewrites
syncChecksand related functions in the health-sync command to fix two bugs:The previous implementation handled missing containers and present containers in separate code paths with different logic, and had no change detection for dataplane health updates.
This rewrite simplifies the logic into clear phases, making it easier to verify correctness.
Current behavior
The
syncChecksfunction has these characteristics:handleHealthForDataplaneContaineris called at the end of every sync cycle, regardless of whether the health status has changedonly containers present in task metadata affect the aggregate health calculation
At scale (thousands of services), the unconditional dataplane health updates caused:
Changes proposed in this PR:
syncChecksis now structured as three distinct phases:Gather state
getContainerHealthStatusesreplacesfindContainersToSync. Where the old function returned two separate slices (found containers and missing containers), the new function returns a single map of container name to ECS health status.Missing containers are explicitly marked as
UNHEALTHYin this map, rather than being handled in a separate code path.This directly fixes #300 - missing containers now affect the overall health calculation because they're represented in the same data structure as present containers.
Compute checks
computeCheckStatusesis new. It takes the container health map and produces a map of Consul check ID to Consul check status. This is where the dataplane container's special handling now lives: it maps to both the service check and the proxy check (for non-gateways), and its status is the aggregate of all container health viacomputeOverallDataplaneHealth.By computing all the check IDs and their desired statuses upfront, we eliminate the need for
handleHealthForDataplaneContainer. That function existed to handle the "update two checks" logic for the dataplane container, but nowcomputeCheckStatusessimply produces both check IDs in its output map. The means other special case behaviour for the dataplane container isn't needed.Update Consul
The update loop is now trivial: iterate over the computed check statuses, compare each to the previous status, and only call
updateConsulHealthStatusif it changed. This fixes #309 - dataplane health is no longer updated unconditionally.updateConsulHealthStatusnow accepts the consul check status directly rather than an ECS health status, since the conversion happens in phase 2. This avoids mixing of health status types, and keeps conversion in one place.Changes to setChecksCritical
setChecksCritical(used during graceful shutdown) is updated to usecomputeCheckStatusesfor consistency. It creates a container status map with all containers marked unhealthy, callscomputeCheckStatusesto get the check IDs, then updates them all to critical. This replaces its previous approach of iterating containers with special-case handling for the dataplane container.How I've tested this PR:
getContainerHealthStatuses,computeOverallDataplaneHealth, andcomputeCheckStatusesTestSyncChecks_ChangeDetection,TestSyncChecks_Gateway_ChangeDetection) that verify:This change has been deployed and validated in an environment with thousands of registered services:
The original
syncCheckscomment references "Consul TTL checks", but the implementation updates healthvia
Catalog().Register(). This PR preserves that approach. I'm not very familiar with Consul internals, but it seemslike this a referring to the Agent TTL API. If there's historical context on why catalog registration is used rather
than the Agent TTL API, it may be worth documenting.
Checklist:
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.
Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.