refactor(cilium): drop cilium_envoy_resync — 1.19.4 ports stable#213
Merged
Conversation
The `terraform_data.cilium_envoy_resync` mitigation (#196) force-rolled `cilium-envoy` on every Cilium release change to cover L7 proxy-port desync: a restarted `cilium-agent` re-allocated the Gateway API proxy ports while the un-rolled envoy pods kept their old listeners, dead- ending `*.lab.jackhall.dev` traffic. A `cilium-agent` rollout on Cilium 1.19.4 was tested on `rockingham` (the last open #198 acceptance criterion, #210). Across the roll the L7LB BPF ports stayed byte-identical to the `cilium-envoy` listener ports on all three workers (lab: 16276/10776/15083, projects: 16941/10669/18165), and the `lab` Gateway served HTTP 200 throughout. 1.19.4 fixed the underlying instability, so the mitigation is no longer needed. Removes the resource and its leading comment, the `local.cilium_values` comment explaining the hash-trigger rationale, and the resync paragraph in the bootstrap README. `tofu plan` against live `rockingham` state: 0 to add, 0 to change, 1 to destroy — the `terraform_data` resource only, a no-op state removal (no destroy-time provisioner). Closes: #210 Refs: #196, #198 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
Terraform plan:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Removes the
terraform_data.cilium_envoy_resyncmitigation fromterraform/bootstrap/cilium.tfand its references — the last open acceptance criterion of #198, tracked in #210.Why
The resync (added in #196) force-rolled
cilium-envoyon every Cilium release change to cover an L7 proxy-port desync: a restartedcilium-agentre-allocated the Gateway API proxy ports while the un-rolled envoy pods kept their old listeners, blackholing*.lab.jackhall.dev. Removing it was gated on a runtime check that could only run once hop 3 (Cilium 1.19.4, #209) was live.Runtime test —
rockingham, Cilium 1.19.4Rolled the
cilium-agentDaemonSet (kubectl -n kube-system rollout restart daemonset/cilium);cilium-envoyleft untouched. Comparedcilium-dbg bpf lb listL7LB ports againstcilium-dbg envoy admin listenersbefore and after:labBPF / envoy (before → after)projectsBPF / envoy (before → after)Every L7 proxy port stayed byte-identical across a full agent roll, with BPF and envoy in agreement on every node. A continuous reachability probe against the
labGateway returned HTTP200on every request through the roll window. 1.19.4 fixed the underlying instability — the mitigation is no longer needed.Changes
cilium.tf— removed theterraform_data.cilium_envoy_resyncresource block + its leading comment, and thelocal.cilium_valuescomment explaining why the values were held in alocalfor hash-triggering.terraform/bootstrap/README.md— removed the resync paragraph from "Upgrading Cilium".var.kube_contextis kept — still used byproviders.tf.Verification
tofu fmt -check/tofu validate— pass.tofu planagainst liverockinghamstate:0 to add, 0 to change, 1 to destroy— onlyterraform_data.cilium_envoy_resync, a no-op state removal (no destroy-time provisioner).Post-merge step
After merge, run
tofu applyonrockingham(destroys theterraform_dataresource) and confirm a secondtofu planshowsNo changes— the final acceptance criterion.Closes: #210
Refs: #196, #198
🤖 Generated with Claude Code