Skip to content

refactor(cilium): drop cilium_envoy_resync — 1.19.4 ports stable#213

Merged
jvcorredor merged 1 commit into
mainfrom
homelab-210-remove-cilium-envoy-resync
May 17, 2026
Merged

refactor(cilium): drop cilium_envoy_resync — 1.19.4 ports stable#213
jvcorredor merged 1 commit into
mainfrom
homelab-210-remove-cilium-envoy-resync

Conversation

@jvcorredor

Copy link
Copy Markdown
Member

What

Removes the terraform_data.cilium_envoy_resync mitigation from terraform/bootstrap/cilium.tf and its references — the last open acceptance criterion of #198, tracked in #210.

Why

The resync (added in #196) force-rolled cilium-envoy on every Cilium release change to cover an L7 proxy-port desync: a restarted cilium-agent re-allocated the Gateway API proxy ports while the un-rolled envoy pods kept their old listeners, blackholing *.lab.jackhall.dev. Removing it was gated on a runtime check that could only run once hop 3 (Cilium 1.19.4, #209) was live.

Runtime test — rockingham, Cilium 1.19.4

Rolled the cilium-agent DaemonSet (kubectl -n kube-system rollout restart daemonset/cilium); cilium-envoy left untouched. Compared cilium-dbg bpf lb list L7LB ports against cilium-dbg envoy admin listeners before and after:

Node lab BPF / envoy (before → after) projects BPF / envoy (before → after)
worker-01 16276/16276 → 16276/16276 ✓ 16941/16941 → 16941/16941 ✓
worker-02 10776/10776 → 10776/10776 ✓ 10669/10669 → 10669/10669 ✓
worker-03 15083/15083 → 15083/15083 ✓ 18165/18165 → 18165/18165 ✓

Every L7 proxy port stayed byte-identical across a full agent roll, with BPF and envoy in agreement on every node. A continuous reachability probe against the lab Gateway returned HTTP 200 on every request through the roll window. 1.19.4 fixed the underlying instability — the mitigation is no longer needed.

Changes

  • cilium.tf — removed the terraform_data.cilium_envoy_resync resource block + its leading comment, and the local.cilium_values comment explaining why the values were held in a local for hash-triggering.
  • terraform/bootstrap/README.md — removed the resync paragraph from "Upgrading Cilium".

var.kube_context is kept — still used by providers.tf.

Verification

  • tofu fmt -check / tofu validate — pass.
  • tofu plan against live rockingham state: 0 to add, 0 to change, 1 to destroy — only terraform_data.cilium_envoy_resync, a no-op state removal (no destroy-time provisioner).

Post-merge step

After merge, run tofu apply on rockingham (destroys the terraform_data resource) and confirm a second tofu plan shows No changes — the final acceptance criterion.

Closes: #210
Refs: #196, #198

🤖 Generated with Claude Code

The `terraform_data.cilium_envoy_resync` mitigation (#196) force-rolled
`cilium-envoy` on every Cilium release change to cover L7 proxy-port
desync: a restarted `cilium-agent` re-allocated the Gateway API proxy
ports while the un-rolled envoy pods kept their old listeners, dead-
ending `*.lab.jackhall.dev` traffic.

A `cilium-agent` rollout on Cilium 1.19.4 was tested on `rockingham`
(the last open #198 acceptance criterion, #210). Across the roll the
L7LB BPF ports stayed byte-identical to the `cilium-envoy` listener
ports on all three workers (lab: 16276/10776/15083, projects:
16941/10669/18165), and the `lab` Gateway served HTTP 200 throughout.
1.19.4 fixed the underlying instability, so the mitigation is no
longer needed.

Removes the resource and its leading comment, the `local.cilium_values`
comment explaining the hash-trigger rationale, and the resync paragraph
in the bootstrap README.

`tofu plan` against live `rockingham` state: 0 to add, 0 to change,
1 to destroy — the `terraform_data` resource only, a no-op state
removal (no destroy-time provisioner).

Closes: #210
Refs: #196, #198

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Terraform plan: terraform/bootstrap/

Changes detected. Review the plan before merging.

Commit: d3d04ae35e5e53fffdd1fdcd1db4a51bf1928b82 · Job log

Plan output
data.terraform_remote_state.gcp: Reading...
data.http.gateway_api_crds: Reading...
data.http.local_path_manifest: Reading...
data.http.local_path_manifest: Read complete after 0s [id=https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.32/deploy/local-path-storage.yaml]
data.kubectl_file_documents.local_path: Reading...
data.kubectl_file_documents.local_path: Read complete after 0s [id=1389a68e17a3035b7be0fdada9a9ecc7063cc2e5fee88fcbf9bfd87c0e30a38c]
data.http.gateway_api_crds: Read complete after 0s [id=https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/experimental-install.yaml]
data.kubectl_file_documents.gateway_api_crds: Reading...
data.kubectl_file_documents.gateway_api_crds: Read complete after 0s [id=553327e0ff32a1a2be446bf93823c8413cf9253ac6a6d5407eebd1e8d269f69e]
data.terraform_remote_state.gcp: Read complete after 1s

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  - destroy

OpenTofu will perform the following actions:

  # terraform_data.cilium_envoy_resync will be destroyed
  # (because terraform_data.cilium_envoy_resync is not in configuration)
  - resource "terraform_data" "cilium_envoy_resync" {
      - id               = "9e067c29-6520-6de5-ee01-2a29d58e005f" -> null
      - triggers_replace = [
          - "1.19.4",
          - "fb40918dcac6f3ebe05df30a8928c3c106775257",
        ] -> null
    }

Plan: 0 to add, 0 to change, 1 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan

To perform exactly these actions, run the following command to apply:
    tofu apply "tfplan"

@jvcorredor jvcorredor merged commit 66b9e65 into main May 17, 2026
4 checks passed
@jvcorredor jvcorredor deleted the homelab-210-remove-cilium-envoy-resync branch May 17, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ops(cilium): re-evaluate and remove cilium_envoy_resync after hop 3 lands

1 participant