Fix health-sync logic by kswap · Pull Request #320 · hashicorp/consul-ecs

kswap · 2026-03-12T07:11:45Z

Changes proposed in this PR:

Fix consul-dataplane delta check to eliminate redundant catalog writes
Double write bug: When consul-dataplane was absent from task metadata, both the missing containers loop and the aggregation block would call handleHealthForDataplaneContainer with unhealthy on the same tick.
Inconsistent status format used: ECS format (healthy/unhealthy) and consul format (Critical/Passing) was used. Fixed it.
If a container listed in healthSyncContainers is missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.
Added tests for all these cases

Github Issue 1: ECS Health Sync Sending Traffic to Missing Unhealthy Containers
Github issue 2: health-sync sends excessive catalog registrations for dataplane container health

How I've tested this PR:

Added unit tests for all the changes done.
Tests on local setup with logs:
No redundant write test:

Eliminate redundant catalog writes

Container listed in healthSyncContainers - if missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.

Regression test (Healthy proxy -> Unhealthy -> Healthy)

Regression test (If ECS task is gracefully stopped, health-sync correctly cleans up in Consul before the task exits)

How I expect reviewers to test this PR:

Checklist:

Tests added
CHANGELOG entry added

PCI review checklist

I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.

Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

…rking data plane healthy

anandmukul93 · 2026-03-16T09:09:26Z

I didnt understand this testing use case
Important :
Container listed in healthSyncContainers - if missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.

Can you tell me why it is passing while initializing?

if you see this PR. It has 3 distinct steps :
https://github.com/hashicorp/consul-ecs/pull/310/changes
Gather state - fetch container health from ECS task metadata and include dataplane and missing logic to unhealthy.
Compute checks - transform container health into Consul check statuses
Update Consul - send updates only for checks whose status has changed

Overall major changes :
We have split paths for missing and toSync containers . it could be more organized as in the above PR.
Why are we assuming dataplane sync to not be done or NOT . the dataplane container starts the proxy and runs a proxy XDS server for configuring envoy(gateway or sidecar) so we register the dataplane and it registers 2 checks proxy and dataplane both. If we dont maintain that it would mean that envoy is wrongly configured for a long time and can result in outdated behavior while still being healthy. so overall health should include dataplane + proxy + app container is what i think.

also
subcommand/mesh-init/checks.go, constructChecks() always appends a dataplane readiness check for the given service ID (CheckID = -consul-dataplane)

a typical service hs 3 synced checks when you have one app container in healthSyncContainers:
service check for app container (-)
service check for dataplane (-consul-dataplane)
proxy check for dataplane (-consul-dataplane)

It all starts from healthSyncContainers which container config.HealthSynccontainers which is basicallY

    merge(
      def,
      {
        dependsOn = flatten(
          concat(
            lookup(def, "dependsOn", []),
            [
              {
                containerName = "consul-dataplane"
                condition     = "HEALTHY"
              }
            ]
        ))
      },
      {
        // Use the def.entryPoint, if defined. Else, use the app_entrypoint, which is null by default.
        entryPoint = lookup(def, "entryPoint", local.app_entrypoint)
        mountPoints = flatten(
          concat(
            lookup(def, "mountPoints", []),
            local.app_mountpoints,
          )
        )
      }
    )
  ]


  defaulted_check_containers or health-sync-containers = [for def in local.container_defs_with_depends_on : def.name
  
  
  # # we add dataplane over the app container definition in health-sync command. 
  
  
  1) Missing vs present still behave differently (and why that matters)
In current subcommand/health-sync/checks.go:
Missing loop writes immediately for each missing non-dataplane container (updateConsulHealthStatus call around line ~121), without checking currentStatuses[name].
Present loop first checks if container.Health.Status == currentStatuses[container.Name] { continue } (around line ~147), so unchanged present containers skip writes.
Impact:
A container that stays missing for many ticks can cause repeated Catalog.Register writes each cycle.
Present containers do not have that churn.
This asymmetry can create noisy logs and unnecessary Consul writes under startup lag / partial task metadata conditions.
It is behaviorally inconsistent with the “checks only updated when changed” comment promise in same file.


2) Dataplane special-case logic is still scattered (and why that matters)
Current dataplane handling is split across several locations in subcommand/health-sync/checks.go:
setChecksCritical dataplane branch (~74)
Missing loop skip for dataplane (~115)
Found loop skip for dataplane (~136)
Final aggregate dataplane write (~167)
Dedicated helper for dual service/proxy check updates (~208)
Impact:
Future behavior changes must touch multiple places; easy to miss one.
Harder to reason about invariants (“who owns dataplane update in this cycle?”).

We may also look at this function that does multiple calls for different health checks. if we can club health checks in one query that would be great.

 func (c *Command) fetchHealthChecks(consulClient *api.Client, taskMeta awsutil.ECSTaskMeta) (map[string]*api.HealthCheck, error) {

Fix consul ecs health-sync ticker logic

4933188

kswap requested review from a team as code owners March 12, 2026 07:11

Fix git fmt

c3b74b8

kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 07:40 — with GitHub Actions Inactive

kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 07:41 — with GitHub Actions Inactive

Refactoring and fix for checking missing containers as well before ma…

20987d2

…rking data plane healthy

kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 10:08 — with GitHub Actions Inactive

kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 10:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix health-sync logic#320

Fix health-sync logic#320
kswap wants to merge 3 commits intomainfrom
ft_ecs_health_check_fix

kswap commented Mar 12, 2026 •

edited

Loading

Uh oh!

anandmukul93 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kswap commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes proposed in this PR:

How I've tested this PR:

How I expect reviewers to test this PR:

Checklist:

PCI review checklist

Uh oh!

anandmukul93 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kswap commented Mar 12, 2026 •

edited

Loading