Fix IPv6 CIDR flapping in k8s.ovn.org/host-cidrs annotation#3167
Fix IPv6 CIDR flapping in k8s.ovn.org/host-cidrs annotation#3167smulje wants to merge 1 commit intoopenshift:release-4.21from
Conversation
Fixes a race condition where the k8s.ovn.org/host-cidrs annotation continuously flaps between dual-stack and IPv4-only states on nodes with IPv6 enabled (particularly with SLAAC). Root cause: The addressManager.sync() function scans network interfaces every 30s to rebuild the address map. When IPv6 addresses are still in "tentative" state during Duplicate Address Detection (DAD), sync() misses them and updates the annotation to IPv4-only. The netlink watcher later discovers the IPv6 address via kernel events and updates back to dual-stack, creating a continuous flapping cycle. Solution: Perform two interface scans with a 100ms delay and take the union of both results. This ensures IPv6 addresses completing DAD during the first scan are captured in the second scan. Changes: 1. New scanInterfaceAddresses() helper function that extracts interface scanning logic for reuse and returns (sets.Set[string], error) to propagate failures 2. Modified sync() function performs double-scan with union and includes error handling with fallback logic: - Both scans succeed: use union (catches IPv6 during DAD) - One scan fails: use the successful scan result - Both scans fail: abort early to preserve existing annotation 3. Better error handling: continues on per-interface errors instead of aborting entire sync, and prevents transient netlink.LinkList() failures from incorrectly clearing the annotation The 100ms delay accommodates typical IPv6 DAD completion times (most complete within 100-200ms) without significantly impacting sync performance (runs every 30s). Tested on OpenShift 4.20.1 with dual-stack networking. Before fix: annotation flapped every 40-70s. After fix: annotation remains stable.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @smulje. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: smulje The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary
Fixes a race condition where the
k8s.ovn.org/host-cidrsannotation continuously flaps between dual-stack and IPv4-only states on nodes with IPv6 enabled (particularly with SLAAC).Problem
On nodes with IPv6 enabled, the
k8s.ovn.org/host-cidrsannotation exhibits unstable behavior, alternating between:["10.46.xx.62/21","2620:52:0:2ef8:7058:xxx:3bd9:e3dc/64"](dual-stack)["10.46.xx.62/21"](IPv4-only)This flapping occurs every 30-60 seconds and impacts workloads that depend on stable host network information.
Root Cause
The
addressManager.sync()function has a race condition with IPv6 Duplicate Address Detection (DAD):Solution
Perform two interface scans with a 100ms delay and take the union of both results. This ensures IPv6 addresses that are completing DAD during the first scan are captured in the second scan.
Changes
New helper function
scanInterfaceAddresses(): Extracts interface scanning logic for reuse and returns(sets.Set[string], error)to propagate failuresModified sync() function: Performs double-scan with union and includes error handling with fallback logic:
Better error handling: Continues on per-interface errors instead of aborting entire sync, and prevents transient
netlink.LinkList()failures from incorrectly clearing the annotationThe 100ms delay accommodates typical IPv6 DAD completion times (most complete within 100-200ms) without significantly impacting sync performance (runs every 30s).
Testing
Reproduction Environment
Reproduction Steps
oc get node <node> -o jsonpath='{.metadata.annotations.k8s\.ovn\.org/host-cidrs}'Verification
Before fix: Flapping detected every 40-70 seconds
After fix: Annotation remains stable with dual-stack
Impact
Related Issues