Skip to content

Fix health-sync logic#320

Open
kswap wants to merge 3 commits intomainfrom
ft_ecs_health_check_fix
Open

Fix health-sync logic#320
kswap wants to merge 3 commits intomainfrom
ft_ecs_health_check_fix

Conversation

@kswap
Copy link
Contributor

@kswap kswap commented Mar 12, 2026

Changes proposed in this PR:

  • Fix consul-dataplane delta check to eliminate redundant catalog writes
  • Double write bug: When consul-dataplane was absent from task metadata, both the missing containers loop and the aggregation block would call handleHealthForDataplaneContainer with unhealthy on the same tick.
  • Inconsistent status format used: ECS format (healthy/unhealthy) and consul format (Critical/Passing) was used. Fixed it.
  • If a container listed in healthSyncContainers is missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.
  • Added tests for all these cases

Github Issue 1: ECS Health Sync Sending Traffic to Missing Unhealthy Containers
Github issue 2: health-sync sends excessive catalog registrations for dataplane container health

How I've tested this PR:

  • Added unit tests for all the changes done.

  • Tests on local setup with logs:

  • No redundant write test:

Bug1_test
  • Eliminate redundant catalog writes
status_change_triggers_write
  • Container listed in healthSyncContainers - if missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.
Bug2_test
  • Regression test (Healthy proxy -> Unhealthy -> Healthy)
Regression

Regression test (If ECS task is gracefully stopped, health-sync correctly cleans up in Consul before the task exits)
Regression 2

How I expect reviewers to test this PR:

Checklist:

  • Tests added
  • CHANGELOG entry added

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

@kswap kswap requested review from a team as code owners March 12, 2026 07:11
@kswap kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 07:40 — with GitHub Actions Inactive
@kswap kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 07:41 — with GitHub Actions Inactive
@kswap kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 10:08 — with GitHub Actions Inactive
@kswap kswap temporarily deployed to dockerhub/hashicorpdev March 12, 2026 10:10 — with GitHub Actions Inactive
@anandmukul93
Copy link
Contributor

I didnt understand this testing use case
Important :
Container listed in healthSyncContainers - if missing from ECS metadata entirely, treating it as unhealthy(critical), not as absent/ignored.

Can you tell me why it is passing while initializing?

if you see this PR. It has 3 distinct steps :
https://github.com/hashicorp/consul-ecs/pull/310/changes
Gather state - fetch container health from ECS task metadata and include dataplane and missing logic to unhealthy.
Compute checks - transform container health into Consul check statuses
Update Consul - send updates only for checks whose status has changed

Overall major changes :
We have split paths for missing and toSync containers . it could be more organized as in the above PR.
Why are we assuming dataplane sync to not be done or NOT . the dataplane container starts the proxy and runs a proxy XDS server for configuring envoy(gateway or sidecar) so we register the dataplane and it registers 2 checks proxy and dataplane both. If we dont maintain that it would mean that envoy is wrongly configured for a long time and can result in outdated behavior while still being healthy. so overall health should include dataplane + proxy + app container is what i think.

also
subcommand/mesh-init/checks.go, constructChecks() always appends a dataplane readiness check for the given service ID (CheckID = -consul-dataplane)

a typical service hs 3 synced checks when you have one app container in healthSyncContainers:
service check for app container (-)
service check for dataplane (-consul-dataplane)
proxy check for dataplane (-consul-dataplane)

It all starts from healthSyncContainers which container config.HealthSynccontainers which is basicallY

    merge(
      def,
      {
        dependsOn = flatten(
          concat(
            lookup(def, "dependsOn", []),
            [
              {
                containerName = "consul-dataplane"
                condition     = "HEALTHY"
              }
            ]
        ))
      },
      {
        // Use the def.entryPoint, if defined. Else, use the app_entrypoint, which is null by default.
        entryPoint = lookup(def, "entryPoint", local.app_entrypoint)
        mountPoints = flatten(
          concat(
            lookup(def, "mountPoints", []),
            local.app_mountpoints,
          )
        )
      }
    )
  ]


  defaulted_check_containers or health-sync-containers = [for def in local.container_defs_with_depends_on : def.name
  
  
  # # we add dataplane over the app container definition in health-sync command. 
  
  
  1) Missing vs present still behave differently (and why that matters)
In current subcommand/health-sync/checks.go:
Missing loop writes immediately for each missing non-dataplane container (updateConsulHealthStatus call around line ~121), without checking currentStatuses[name].
Present loop first checks if container.Health.Status == currentStatuses[container.Name] { continue } (around line ~147), so unchanged present containers skip writes.
Impact:
A container that stays missing for many ticks can cause repeated Catalog.Register writes each cycle.
Present containers do not have that churn.
This asymmetry can create noisy logs and unnecessary Consul writes under startup lag / partial task metadata conditions.
It is behaviorally inconsistent with the “checks only updated when changed” comment promise in same file.


2) Dataplane special-case logic is still scattered (and why that matters)
Current dataplane handling is split across several locations in subcommand/health-sync/checks.go:
setChecksCritical dataplane branch (~74)
Missing loop skip for dataplane (~115)
Found loop skip for dataplane (~136)
Final aggregate dataplane write (~167)
Dedicated helper for dual service/proxy check updates (~208)
Impact:
Future behavior changes must touch multiple places; easy to miss one.
Harder to reason about invariants (“who owns dataplane update in this cycle?”).
 
We may also look at this function that does multiple calls for different health checks. if we can club health checks in one query that would be great. 
 func (c *Command) fetchHealthChecks(consulClient *api.Client, taskMeta awsutil.ECSTaskMeta) (map[string]*api.HealthCheck, error) {
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants