Fix egress firewall ACL deletion security vulnerability#3144
Fix egress firewall ACL deletion security vulnerability#3144smulje wants to merge 1 commit intoopenshift:release-4.20from
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note
|
| Cohort / File(s) | Summary |
|---|---|
Event Routing go-controller/pkg/ovn/base_event_handler.go |
Added factory.EgressFirewallType to types routed to the dedicated update-function path (prevents fallback to delete-old + add-new). |
Update Handler Implementation go-controller/pkg/ovn/default_network_controller.go |
Added EgressFirewall-specific UpdateResource logic: deep-copy new object, validate/build constructs first, update status on validation failure, remove old constructs and metrics then add validated new constructs on success. |
Validation Refactor go-controller/pkg/ovn/egressfirewall.go |
Refactored addEgressFirewall into a pure build/validate step buildEgressFirewallConstruct, plus addEgressFirewallWithConstruct that performs locking, cache update, and rule programming only after validation succeeds. |
Build & Images Dockerfile.acl-fix-golang, Dockerfile.customer-build, Dockerfile.customer-build-fedora, Dockerfile.customer-build-working |
Added several new multi-stage Dockerfiles to build go-controller binaries (various builder bases) and produce runtime images with labels and correct permissions. |
CI / Tooling Config .coderabbit.yml |
Updated CodeRabbit configuration: added schema and reviews settings, moved vendor ignore into path filters, enabled auto-reply and review behaviors. |
Sequence Diagram(s)
sequenceDiagram
participant Event as Event System
participant Router as Base Handler (Router)
participant Controller as Default Network Controller
participant Firewall as Egress Firewall Logic
participant Status as Status Updater
Event->>Router: Update EgressFirewall event
activate Router
Router->>Router: hasResourceAnUpdateFunc? (EgressFirewall)
Router->>Controller: UpdateResource(new, old)
deactivate Router
activate Controller
Controller->>Firewall: buildEgressFirewallConstruct(new)
activate Firewall
Firewall->>Firewall: validate/build rules
alt Validation fails
Firewall-->>Controller: error
Controller->>Status: update status (validation failed)
Controller-->>Event: return (no mutation)
else Validation succeeds
Firewall-->>Controller: constructed ef object
Controller->>Firewall: deleteEgressFirewall(old)
Controller->>Firewall: addEgressFirewallWithConstruct(new, constructed)
Controller->>Status: update status (success)
Controller->>Controller: update metrics
end
deactivate Firewall
deactivate Controller
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
🐰 A hop, a build, then rules I test,
Validate first — no premature quest.
Old fences fall, new paths take flight,
Metrics tally softly through the night.
The rabbit guards changes — tidy and right.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The pull request title clearly and specifically describes the main change: fixing a security vulnerability in egress firewall ACL deletion. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: smulje The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @smulje. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
f5581e1 to
8a05215
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go-controller/pkg/ovn/default_network_controller.go`:
- Around line 1104-1116: The code currently calls
buildEgressFirewallConstruct(...) and later calls addEgressFirewall(...) which
rebuilds the construct again, allowing a TOCTOU window when
deleteEgressFirewall(...) removes old ACLs; to fix, change the update path to
validate once and pass the validated construct into an add helper that skips
rebuild: after successful buildEgressFirewallConstruct(newEgressFirewall) use a
new private method addEgressFirewallLocked(newEgressFirewall,
validatedConstruct) (or an overload of addEgressFirewall that accepts the
constructed result) and call deleteEgressFirewall(oldEgressFirewall) only after
you have the validatedConstruct and then call addEgressFirewallLocked to install
rules, and ensure setEgressFirewallStatus(...) still receives errors from the
single build; update addEgressFirewall (or add a new addEgressFirewallLocked) to
accept and use the provided construct instead of calling
buildEgressFirewallConstruct again to avoid redundant node lookups and eliminate
the race.
- Around line 1097-1125: The update path is double-applying metrics because it
bypasses the centralized Add/Delete metric paths and then also calls
metrics.UpdateEgressFirewallRuleCount again; fix by replacing the direct calls
to deleteEgressFirewall(oldEgressFirewall) and
addEgressFirewall(newEgressFirewall) with the existing metric-aware helpers
(call DeleteResource(oldEgressFirewall) and AddResource(newEgressFirewall) or
the controller methods that wrap metrics), and remove the final
metrics.UpdateEgressFirewallRuleCount(len(new.Spec.Egress)-len(old.Spec.Egress))
call so metrics are only updated by the standard Add/Delete flows and by
setEgressFirewallStatus(new, err) as designed. Ensure references to
deleteEgressFirewall, addEgressFirewall, setEgressFirewallStatus, and
metrics.UpdateEgressFirewallRuleCount are updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7ef88023-04b9-4003-8b6e-7535fb2abd87
📒 Files selected for processing (3)
go-controller/pkg/ovn/base_event_handler.gogo-controller/pkg/ovn/default_network_controller.gogo-controller/pkg/ovn/egressfirewall.go
2efd854 to
1e3db44
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
Dockerfile.acl-fix-golang (1)
23-29: Avoid hardcoding the fix commit in image metadata.Line 29 will go stale if this draft PR is amended or rebased. Inject the source revision at build time so audits reflect the actual image contents.
Suggested build-arg label
+ARG FIX_COMMIT=unknown + # Document the fix LABEL io.k8s.display-name="ovn kubernetes with ACL security fix" \ io.k8s.description="EgressFirewall ACL Security Fix (OCPBUGS-78765)" \ io.openshift.tags="openshift" \ maintainer="smulje@redhat.com" \ fix="Validates BEFORE deleting ACLs, fixes TOCTOU, fixes metrics double-counting" \ - commit="1e3db44fa0bbbcc11abd8c32c3edf724a10f7bca" + commit="${FIX_COMMIT}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile.acl-fix-golang` around lines 23 - 29, The Dockerfile hardcodes a commit value in the LABEL (commit="1e3db44...") which will go stale; change the LABEL to consume a build ARG instead: add an ARG like SOURCE_COMMIT (default empty or "unknown"), replace the literal commit label with commit="${SOURCE_COMMIT}", and document using docker build --build-arg SOURCE_COMMIT=$(git rev-parse --short HEAD) (or your CI variable) so the image metadata reflects the actual revision; update the LABEL statement that references commit and ensure ARG SOURCE_COMMIT is declared before the LABEL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile.acl-fix-golang`:
- Around line 3-13: The Dockerfile currently builds a CGO-enabled binary with
the Debian-based golang:1.24 builder and then runs it in an OCP/RHEL runtime
(the RUN cd go-controller && CGO_ENABLED=1 make step produces a binary
incompatible with the FROM quay.io/openshift... base); replace the builder stage
with an OCP/RHEL-compatible Go builder image (use the standard OCP builder that
matches RHEL 9) so the compiled binary links against the same glibc, or if you
must keep golang:1.24, make the apt-get install line use --no-install-recommends
and ensure the toolchain and libc versions are aligned to RHEL 9 to avoid GLIBC
version mismatches when running the binary produced by RUN cd go-controller &&
CGO_ENABLED=1 make.
In `@Dockerfile.customer-build-fedora`:
- Around line 3-12: The builder image must be changed from Fedora to a RHEL-9 /
UBI9 builder so CGO-built binaries (CGO_ENABLED=1 in the go-controller make
step) link against the same glibc version used by the OpenShift 4.20 runtime;
replace the first stage FROM line (currently "FROM fedora:42 AS builder") with a
RHEL9/UBI9 builder image (for example a UBI9 base like
registry.access.redhat.com/ubi9/ubi or another RHEL-9 compatible builder), keep
the WORKDIR/COPY and the CGO_ENABLED=1 make invocation in go-controller, and
ensure the rest of the Dockerfile still uses the runtime "FROM
quay.io/openshift/origin-ovn-kubernetes:4.20".
In `@Dockerfile.customer-build-working`:
- Around line 36-37: The RUN chmod -R g=u /usr/bin /usr/libexec/cni line is too
broad and makes all /usr/bin group-writable; instead scope permissions to only
the copied/hotfix binaries or enforce minimal executable/read bits for those
files: identify the specific artifacts you add to /usr/bin and /usr/libexec/cni
and apply permissions only to those paths (instead of the global /usr/bin), or
change the operation to set group read/execute (not write) on the installed
binaries; update the Dockerfile by replacing the global RUN chmod -R g=u
/usr/bin /usr/libexec/cni with a scoped permission step targeting just your
added files.
- Around line 2-21: The final runtime FROM in the Dockerfile points to a
non-existent/incorrect image (quay.io/openshift/origin-ovn-kubernetes:4.18);
update that final FROM line to use the correct OpenShift base image for
release-4.20 (use the repository referenced in Dockerfile.base), e.g. replace
the final FROM with registry.ci.openshift.org/ocp/4.21:base-rhel9 so the runtime
stage matches the expected OCP base; leave the builder stage (FROM fedora:42 AS
builder) and the CGO_ENABLED=1 make step unchanged.
---
Nitpick comments:
In `@Dockerfile.acl-fix-golang`:
- Around line 23-29: The Dockerfile hardcodes a commit value in the LABEL
(commit="1e3db44...") which will go stale; change the LABEL to consume a build
ARG instead: add an ARG like SOURCE_COMMIT (default empty or "unknown"), replace
the literal commit label with commit="${SOURCE_COMMIT}", and document using
docker build --build-arg SOURCE_COMMIT=$(git rev-parse --short HEAD) (or your CI
variable) so the image metadata reflects the actual revision; update the LABEL
statement that references commit and ensure ARG SOURCE_COMMIT is declared before
the LABEL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro Plus
Run ID: fd5bcf4b-8ea4-46dc-84bd-bc565602c8bb
📒 Files selected for processing (7)
.coderabbit.ymlDockerfile.acl-fix-golangDockerfile.customer-buildDockerfile.customer-build-fedoraDockerfile.customer-build-workinggo-controller/pkg/ovn/default_network_controller.gogo-controller/pkg/ovn/egressfirewall.go
🚧 Files skipped from review as they are similar to previous changes (2)
- go-controller/pkg/ovn/default_network_controller.go
- go-controller/pkg/ovn/egressfirewall.go
Egress firewall updates were deleting ACLs before validating new rules, creating a security gap where namespaces would be left without egress protection if validation failed. Changes in this commit: 1. Added buildEgressFirewallConstruct() to validate rules before making any database changes 2. Added UpdateResource handler for EgressFirewall that: - Validates new rules BEFORE deleting old ACLs - Only proceeds with update if validation succeeds - Preserves existing ACLs if validation fails - Fixes TOCTOU by validating once and reusing the construct - Fixes metrics double-counting 3. Registered EgressFirewall in hasResourceAnUpdateFunc() to use the UpdateResource path instead of delete-then-add Result: Invalid EgressFirewall updates now preserve existing ACLs instead of deleting them, keeping the namespace protected. Fixes: OCPBUGS-78765 Signed-off-by: Swati Mulje <smulje@redhat.com>
35a91a5 to
4b15ddc
Compare
|
@coderabbitai review |
|
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit. |
4b15ddc to
d14551f
Compare
|
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit. |
d14551f to
35a91a5
Compare
Summary
This PR fixes a critical security vulnerability where EgressFirewall updates could leave namespaces without egress protection.
Problem
When an EgressFirewall is updated with invalid rules, the existing code deleted ACLs before validation. If validation failed, the namespace was left without egress firewall protection, allowing unrestricted egress traffic (fail-open instead of fail-closed).
Solution
This PR adds an UpdateResource handler that validates new rules BEFORE deleting existing ACLs:
Result
Invalid updates now preserve existing ACLs instead of deleting them, ensuring the namespace remains protected (fail-closed).
Changes
go-controller/pkg/ovn/base_event_handler.go- Register EgressFirewall update handlergo-controller/pkg/ovn/default_network_controller.go- Add UpdateResource casego-controller/pkg/ovn/egressfirewall.go- Add validation functionTechnical Details
Vulnerable Code Flow (4.19/4.20):
If validation fails after deletion, no ACLs are present, resulting in unrestricted egress traffic.
Fixed Code Flow (4.21):
If validation fails, the operation terminates before modifying existing ACLs, preserving the current security posture.
Associated Bug
https://redhat.atlassian.net/browse/OCPBUGS-78765
Upstream Fix
Upstream PR: ovn-kubernetes/ovn-kubernetes#5688
Commit: ecabc89
The fix has been confirmed in the OpenShift downstream repository:
Commit history in release-4.21:
https://github.com/openshift/ovn-kubernetes/commits/release-4.21/go-controller/pkg/ovn/controller/egressfirewall/egressfirewall.go
Direct commit link:
ecabc89
Testing
Backport Conflict
OCP 4.21: Uses a dedicated EFController with new features like caching (cacheEntry), DNS resolvers, network manager interfaces (for UDN support), and different function signatures for ACL building and status updates.
Summary by CodeRabbit