e2e: EVPN with disruptive actions#3161
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jechen0648 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (16)
WalkthroughRepoints various internal OpenShift test imports from Changes
Sequence DiagramsequenceDiagram
participant Test as Test Suite
participant FRR as External FRR (Spine)
participant OVN as OVN-K / FRR-K8s (DaemonSets)
participant Node as Worker Node
participant Pod as EVPN Pods
Test->>FRR: Create VNIs/VRFs, BGP config, VTEPs
FRR-->>Test: BGP/EVPN ready
Test->>OVN: Deploy pinned EVPN pods
OVN-->>Test: Pods Running
Test->>Test: Verify connectivity & isolation
loop For each disruption scenario
alt Node reboot
Test->>Node: Reboot (provider RebootNode)
Node-->>Test: Node back
Test->>OVN: Wait daemonset rollout / recreate pods
Test->>FRR: Re-add VTEP loopback
else OVN-K restart
Test->>OVN: Restart ovnkube-node daemonset
OVN-->>Test: Rollout complete
else FRR-K8s restart
Test->>OVN: Restart FRR-K8s daemonset
OVN-->>Test: Rollout complete
else Spine restart
Test->>FRR: Destroy/recreate kernel state + restart daemons
FRR-->>Test: Ready
end
Test->>Test: Wait BGP/EVPN convergence
Test->>Test: Re-verify connectivity & isolation
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
openshift/go.mod (2)
234-254: Maintainability:replaceoverrides manyk8s.io/*versions.The
requireblock pins e.g.k8s.io/api v0.35.1, but thereplaceblock overrides it togithub.com/openshift/kubernetes/staging/...at a different pseudo-version. This is probably intended for OpenShift compatibility, but it’s easy for future maintainers to misread.Consider adding a short comment (if not already elsewhere) clarifying that the
requireversions are largely placeholders/constraints and thereplaceblock is the real source of truth for the staging modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift/go.mod` around lines 234 - 254, Add a brief clarifying comment near the go.mod replace block explaining that the top-level require pins (e.g. k8s.io/api v0.35.1) are nominal and that the replace directives for k8s.io/* modules (e.g. replace k8s.io/api => github.com/openshift/kubernetes/staging/src/k8s.io/api ...) are the authoritative source used for OpenShift compatibility; place the comment adjacent to the existing replace block and mention the specific modules (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, etc.) so future maintainers understand the intent.
14-14: Check the placeholdertest/e2epseudo-version is intentional.Requiring
github.com/ovn-kubernetes/ovn-kubernetes/test/e2e v0.0.0-00010101000000-000000000000is often used alongside a localreplace(here:=> ../test/e2e). If this pattern is intentional, it’s fine; otherwise, consider tightening it to avoid confusion and to reduce surprises when runninggo mod tidy/go list.Also applies to: 224-230
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift/go.mod` at line 14, The go.mod entry using the pseudo-version for github.com/ovn-kubernetes/ovn-kubernetes/test/e2e looks like an intentional local override pattern; verify whether you meant to depend on a local module via a replace or on a published tag. If you intend a local module, add or keep a matching replace directive pointing to ../test/e2e (or the correct relative path) so the pseudo-version is explicit and tidy won't replace it; if you intended to track a released version, replace the pseudo-version with the correct module tag/commit SHA. Also apply the same verification/fix for the other occurrences mentioned (lines 224-230).test/e2e/util.go (1)
2118-2118: Docstring overstates behavior (“force-deletes”).Line 2118 says force-deletes, but this path performs delete + wait, not force deletion semantics. Consider tightening wording to avoid confusion.
Suggested wording fix
-// DeletePodWithWaitByName force-deletes the named pod in the given namespace and waits for it to disappear. +// DeletePodWithWaitByName deletes the named pod in the given namespace and waits for it to disappear.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util.go` at line 2118, The docstring for DeletePodWithWaitByName incorrectly says it "force-deletes" the pod; update the comment to accurately reflect the behavior (it performs a normal delete and waits for the pod to disappear) — e.g., change "force-deletes the named pod" to "deletes the named pod and waits for it to disappear" or similar wording to remove the implication of force-delete semantics.test/e2e/infraprovider/providers/kind/kind.go (1)
158-163: Wrap start failure with reboot context.Line 162 returns the raw
StartContainererror; adding node/action context will improve failure diagnosis.Suggested diff
func (k *kind) RebootNode(nodeName string) error { if err := k.engine.StopContainer(nodeName); err != nil { return fmt.Errorf("failed to stop KinD node %q during reboot: %w", nodeName, err) } - return k.engine.StartContainer(nodeName) + if err := k.engine.StartContainer(nodeName); err != nil { + return fmt.Errorf("failed to start KinD node %q during reboot: %w", nodeName, err) + } + return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/infraprovider/providers/kind/kind.go` around lines 158 - 163, The RebootNode function currently returns the raw error from k.engine.StartContainer(nodeName); instead, wrap that error with reboot context to improve diagnostics: after calling k.engine.StartContainer(nodeName) capture its error and, if non-nil, return fmt.Errorf("failed to start KinD node %q during reboot: %w", nodeName, err). Keep the existing StopContainer error handling as-is (it already wraps), and update the return in RebootNode to wrap StartContainer failures using the same pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openshift/pkg/infraprovider/openshift.go`:
- Around line 174-179: RebootNode currently discards any error from the reboot
initiation; change it to surface initiation failures by invoking
ExecK8NodeCommand in a goroutine that sends its error on a channel, then select
on that channel (with a short timeout) to return any non-nil error to the
caller; keep the actual reboot fire-and-forget semantics if the command succeeds
or if you purposely ignore a late failure, but do not always return nil.
Reference the RebootNode method and the ExecK8NodeCommand call when making this
change.
In `@openshift/test/evpn.go`:
- Around line 171-174: The test currently falls back by assigning
otherWorkerNode = targetWorkerNode which breaks topology; instead ensure two
distinct worker nodes are selected and fail the test if a second distinct worker
isn’t available: when discovering nodes, replace the fallback assignment with a
check that otherWorkerNode is non-empty and not equal to targetWorkerNode (use
gomega.Expect(otherWorkerNode).NotTo(gomega.BeEmpty(), ...) and
gomega.Expect(otherWorkerNode).NotTo(gomega.Equal(targetWorkerNode), ...));
ensure TestPod and otherPod are left targeted to those distinct nodes so the
node-restart path recreates both pods correctly.
In `@test/e2e/evpn.go`:
- Around line 909-918: The code is hardcoding "/32" when adding VTEP loopback
addresses via infraprovider.Get().ExecK8NodeCommand which breaks IPv6; replace
the fixed "/32" with a family-specific host prefix (e.g., use "/32" for IPv4 and
"/128" for IPv6) by checking the IP family (ip.To4() or equivalent) and append
the correct suffix when building the addr string passed to ExecK8NodeCommand;
also apply the same hostAddr/prefix selection logic to the host-cidrs polling
and cleanup paths so adding, polling and removal use the same family-specific
prefix consistently (update the helper that constructs the loopback addr and any
code paths that currently assume "/32").
---
Nitpick comments:
In `@openshift/go.mod`:
- Around line 234-254: Add a brief clarifying comment near the go.mod replace
block explaining that the top-level require pins (e.g. k8s.io/api v0.35.1) are
nominal and that the replace directives for k8s.io/* modules (e.g. replace
k8s.io/api => github.com/openshift/kubernetes/staging/src/k8s.io/api ...) are
the authoritative source used for OpenShift compatibility; place the comment
adjacent to the existing replace block and mention the specific modules
(k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, etc.) so future maintainers
understand the intent.
- Line 14: The go.mod entry using the pseudo-version for
github.com/ovn-kubernetes/ovn-kubernetes/test/e2e looks like an intentional
local override pattern; verify whether you meant to depend on a local module via
a replace or on a published tag. If you intend a local module, add or keep a
matching replace directive pointing to ../test/e2e (or the correct relative
path) so the pseudo-version is explicit and tidy won't replace it; if you
intended to track a released version, replace the pseudo-version with the
correct module tag/commit SHA. Also apply the same verification/fix for the
other occurrences mentioned (lines 224-230).
In `@test/e2e/infraprovider/providers/kind/kind.go`:
- Around line 158-163: The RebootNode function currently returns the raw error
from k.engine.StartContainer(nodeName); instead, wrap that error with reboot
context to improve diagnostics: after calling k.engine.StartContainer(nodeName)
capture its error and, if non-nil, return fmt.Errorf("failed to start KinD node
%q during reboot: %w", nodeName, err). Keep the existing StopContainer error
handling as-is (it already wraps), and update the return in RebootNode to wrap
StartContainer failures using the same pattern.
In `@test/e2e/util.go`:
- Line 2118: The docstring for DeletePodWithWaitByName incorrectly says it
"force-deletes" the pod; update the comment to accurately reflect the behavior
(it performs a normal delete and waits for the pod to disappear) — e.g., change
"force-deletes the named pod" to "deletes the named pod and waits for it to
disappear" or similar wording to remove the implication of force-delete
semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d2e258b-4a45-4861-b49b-d72213144640
⛔ Files ignored due to path filters (1)
openshift/pkg/generated/zz_generated.annotations.gois excluded by!**/generated/**
📒 Files selected for processing (24)
openshift/cmd/annotate/main.goopenshift/cmd/ovn-kubernetes-tests-ext/main.goopenshift/cmd/ovn-kubernetes-tests-ext/provider.goopenshift/go.modopenshift/hack/update-tests-annotation.shopenshift/pkg/annotate/annotate.goopenshift/pkg/annotate/rules.goopenshift/pkg/deploymentconfig/openshift.goopenshift/pkg/e2e_test.goopenshift/pkg/infraprovider/baremetal.goopenshift/pkg/infraprovider/openshift.goopenshift/pkg/namespace.goopenshift/pkg/tests.goopenshift/test/evpn.gotest/e2e/deploymentconfig/api/api.gotest/e2e/deploymentconfig/configs/kind/kind.gotest/e2e/e2e.gotest/e2e/evpn.gotest/e2e/infraprovider/api/api.gotest/e2e/infraprovider/providers/kind/kind.gotest/e2e/network_segmentation.gotest/e2e/network_segmentation_utils.gotest/e2e/route_advertisements.gotest/e2e/util.go
Clarify in docs that ShutdownNode keeps the node off until StartNode, and add RebootNode for a full stop+start cycle without a separate StartNode. KIND implements RebootNode by stopping and starting the node container. Signed-off-by: Jean Chen <jechen@redhat.com>
Expose the frr-k8s DaemonSet name (e.g. frr-k8s-daemon on KIND) so tests can wait for rollout after FRR-K8s pod restarts. Signed-off-by: Jean Chen <jechen@redhat.com>
Export small wrappers used by the OpenShift EVPN work: IP/annotation helpers, subnet filters, ovnkube-node restart functions, and node readiness/ GW checks, without duplicating e2e internals in another module. Signed-off-by: Jean Chen <jechen@redhat.com>
476eacd to
cb362e6
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
openshift/test/evpn.go (1)
336-340: Avoid unconditional 10-minute sleep on BGP failure.This can heavily inflate serial-lane runtime on failures; consider gating it behind a debug flag.
💡 Suggested refactor
import ( "context" "fmt" "net" + "os" "time" @@ - if bgpErr != nil { + if bgpErr != nil { ginkgo.GinkgoLogr.Info("BGP EVPN sessions not established — pausing 10 minutes for debugging before cleanup") ginkgo.GinkgoLogr.Info("Inspect with: oc get frrconfiguration -n openshift-frr-k8s -o yaml") ginkgo.GinkgoLogr.Info("Inspect with: podman exec frr vtysh -c 'show bgp l2vpn evpn summary'") - time.Sleep(10 * time.Minute) + if os.Getenv("EVPN_DEBUG_PAUSE_ON_FAILURE") == "true" { + time.Sleep(10 * time.Minute) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift/test/evpn.go` around lines 336 - 340, The unconditional 10-minute time.Sleep after logging BGP failure should be gated behind a debug flag; replace the direct time.Sleep(10 * time.Minute) in the evpn failure path with a conditional that checks a boolean (e.g., debugPauseOnBGP or an env var like DEBUG_BGP_PAUSE) before sleeping, log that the pause is enabled and its duration, and ensure the flag defaults to false so CI/serial lanes are not delayed; update references around ginkgo.GinkgoLogr.Info and the time.Sleep call accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openshift/test/evpn.go`:
- Around line 191-193: The three calls to e2e.RandomVTEPSubnets() assign the
error return to blank and thus discard failures; capture the error for each call
(e.g., use l3VtepV4, err := e2e.RandomVTEPSubnets()) and handle it immediately —
in this test function return or fail the test (e.g., t.Fatalf or t.Fatal) with a
clear message including the error and which VTEP (l3VtepV4, l2VtepV4,
l2l3VtepV4) failed, so invalid/empty CIDRs are not propagated later from
RandomVTEPSubnets.
In `@test/e2e/route_advertisements.go`:
- Around line 3024-3027: The shared EVPN mode case (cudnAdvertisedEVPNShared)
sets targetVRF and networkLabels but selects FRRConfiguration via
frrConfigurationLabels = map[string]string{"network": testName}, which won't
match the shared FRR config; update the EVPN infra setup code that creates
FRRConfigurations to handle the cudnAdvertisedEVPNShared branch and pass/label
the shared FRR config name, and change the selector here
(frrConfigurationLabels) to reference the shared FRR configuration key (e.g.,
use the shared FRR config name or label like {"network": networkName} or
{"shared-frr": testName}) so RouteAdvertisements can match the shared
FRRConfiguration; ensure the symbol cudnAdvertisedEVPNShared and
frrConfigurationLabels (and any FRR config creation function in the EVPN infra
path) are updated consistently.
---
Nitpick comments:
In `@openshift/test/evpn.go`:
- Around line 336-340: The unconditional 10-minute time.Sleep after logging BGP
failure should be gated behind a debug flag; replace the direct time.Sleep(10 *
time.Minute) in the evpn failure path with a conditional that checks a boolean
(e.g., debugPauseOnBGP or an env var like DEBUG_BGP_PAUSE) before sleeping, log
that the pause is enabled and its duration, and ensure the flag defaults to
false so CI/serial lanes are not delayed; update references around
ginkgo.GinkgoLogr.Info and the time.Sleep call accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 3b17c4e7-0a22-4423-a019-be7b1028a2e7
⛔ Files ignored due to path filters (1)
openshift/pkg/generated/zz_generated.annotations.gois excluded by!**/generated/**
📒 Files selected for processing (24)
openshift/cmd/annotate/main.goopenshift/cmd/ovn-kubernetes-tests-ext/main.goopenshift/cmd/ovn-kubernetes-tests-ext/provider.goopenshift/go.modopenshift/hack/update-tests-annotation.shopenshift/pkg/annotate/annotate.goopenshift/pkg/annotate/rules.goopenshift/pkg/deploymentconfig/openshift.goopenshift/pkg/e2e_test.goopenshift/pkg/infraprovider/baremetal.goopenshift/pkg/infraprovider/openshift.goopenshift/pkg/namespace.goopenshift/pkg/tests.goopenshift/test/evpn.gotest/e2e/deploymentconfig/api/api.gotest/e2e/deploymentconfig/configs/kind/kind.gotest/e2e/e2e.gotest/e2e/evpn.gotest/e2e/infraprovider/api/api.gotest/e2e/infraprovider/providers/kind/kind.gotest/e2e/network_segmentation.gotest/e2e/network_segmentation_utils.gotest/e2e/route_advertisements.gotest/e2e/util.go
✅ Files skipped from review due to trivial changes (3)
- openshift/hack/update-tests-annotation.sh
- openshift/pkg/annotate/rules.go
- test/e2e/network_segmentation_utils.go
🚧 Files skipped from review as they are similar to previous changes (11)
- test/e2e/deploymentconfig/api/api.go
- test/e2e/deploymentconfig/configs/kind/kind.go
- openshift/pkg/deploymentconfig/openshift.go
- openshift/cmd/annotate/main.go
- openshift/cmd/ovn-kubernetes-tests-ext/provider.go
- openshift/pkg/infraprovider/openshift.go
- test/e2e/e2e.go
- test/e2e/infraprovider/api/api.go
- test/e2e/network_segmentation.go
- test/e2e/util.go
- test/e2e/evpn.go
Support multiple EVPN namespaces sharing one FRRConfiguration: RouteAdvertisements
selectors use a per-network advertise label while FRRConfiguration uses {network: testName}.
Signed-off-by: Jean Chen <jechen@redhat.com>
- Make FRR setup idempotent: apply FRRConfiguration, write memory, tolerate cleanup errors, idempotent VTEP loopback addresses - Use shared FRRConfiguration for multiple EVPN sessions - Disruptive flow: EVPNDisruptiveState, destroy/reapply kernel state on external FRR, FRR daemon restart, FRR-k8s and ovnkube-node/DS checks, BGP and VNI verification (parse show evpn vni json; L2 numRemoteVteps, L3 numArpNd) - Export SetupEVPNNetworkWithServers, network spec factories, and pod reachability helpers; rename to RestartExternalFRRDaemons Signed-off-by: Jean Chen <jechen@redhat.com>
Relocate shared OpenShift test registration (lists, generated annotations, annotate rules, deploymentconfig and infraprovider shims) from openshift/test/ to openshift/pkg/ and update annotate, go generate, and ovn-kubernetes-tests-ext imports. Add openshift/test/evpn.go: multi-VPN EVPN coverage with node, OVN-K, FRR-K8s, and external FRR (spine) disruption using the shared helpers in test/e2e. Signed-off-by: Jean Chen <jechen@redhat.com>
cb362e6 to
ff5558e
Compare
|
@jechen0648: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
📑 Description
Summary
Adds an OpenShift-only EVPN disruptive e2e test suite (openshift/test/evpn.go) that runs three EVPN topologies (L3 IP-VRF, L2 MAC-VRF, L2 MAC-VRF+IP-VRF) with random VTEP subnets, checks connectivity and isolation, then validates recovery after node reboot, OVN-K restart, FRR-K8s restart, and external FRR (“spine”) failure/restart. Core logic lives in reusable helpers under test/e2e/evpn.go; the OpenShift test harness is moved from openshift/test/ to openshift/pkg/ so openshift/test/ can stay thin test entrypoints.
What changed
test/e2e
ClusterProvider.RebootNode — implements node reboot
DeploymentConfig.FRRK8sDaemonSetName — Lets tests wait on the correct FRR-K8s DaemonSet after restarts when it is KIND vs OpenShift cluster
Exports for OpenShift tests — Small wrappers (IPs, annotations, subnet filtering, ovnkube-node restart, node/GW checks) so OpenShift tests do not duplicate e2e internals.
CUDN_ADVERTISED_EVPNShared — RouteAdvertisement / network type support so multiple EVPN namespaces can share one FRRConfiguration (per-network advertise labels, shared network: testName selector on FRR).
EVPN helper expansion — Idempotent FRR setup, shared FRRConfiguration for multiple sessions, EVPNDisruptiveState, destroy/reapply kernel state on external FRR, RestartExternalFRRDaemons (daemon restart inside the spine container, not Docker-level restart), BGP + VNI checks using show evpn vni JSON (L2: numRemoteVteps; L3: numArpNd), exported SetupEVPNNetworkWithServers / EVPNExternalSetupIDs (real spine VNIs/subnets for anti-flake checks), network factories, reachability helpers.
updates on route_advertisements.go, network_segmentation*.go, util.go, e2e.go adjusted for new APIs and shared RA network type.
openshift/
Harness move — Shared registration, annotations, deploymentconfig/infraprovider shims, and generated files move openshift/test/ → openshift/pkg/; annotate, go generate, and ovn-kubernetes-tests-ext imports updated.
openshift/test/evpn.go — setup EVPN sessions, then do baseline connectivity check(pod↔external, isolation between VPNs, node↔pod negative checks, KAPI healthz), then four It blocks do disruptive actions:1)node restart (including VTEP loopback restore and pod recreate), 2)OVN-K restart, 3)FRR-K8s restart, 4) spine restart (destroy kernel state, restart FRR daemons, wait for process, re-apply kernel state), then verify connectivity again after disruptive action.
Why
Exercise EVPN data and control paths under realistic failures on OpenShift while keeping upstream test/e2e helpers reusable for both KIND and OpenShift extended tests.
Testing
OpenShift extended EVPN suite
KIND EVPN coverage remains compatible via updated runEVPNNetworkAndServers / RA wiring where touched.
Fixes #
Additional Information for reviewers
✅ Checks
How to verify it
Summary by CodeRabbit
New Features
Tests
Chores