Skip to content

Fix egress firewall ACL deletion security vulnerability#3144

Open
smulje wants to merge 1 commit intoopenshift:release-4.20from
smulje:egress-firewall-acl-security-fix
Open

Fix egress firewall ACL deletion security vulnerability#3144
smulje wants to merge 1 commit intoopenshift:release-4.20from
smulje:egress-firewall-acl-security-fix

Conversation

@smulje
Copy link
Copy Markdown

@smulje smulje commented Apr 16, 2026

Summary

This PR fixes a critical security vulnerability where EgressFirewall updates could leave namespaces without egress protection.

Problem

When an EgressFirewall is updated with invalid rules, the existing code deleted ACLs before validation. If validation failed, the namespace was left without egress firewall protection, allowing unrestricted egress traffic (fail-open instead of fail-closed).

Solution

This PR adds an UpdateResource handler that validates new rules BEFORE deleting existing ACLs:

  1. buildEgressFirewallConstruct() - validates rules before any DB changes
  2. UpdateResource handler - validates BEFORE deleting ACLs on updates
  3. Register EgressFirewall in hasResourceAnUpdateFunc()

Result

Invalid updates now preserve existing ACLs instead of deleting them, ensuring the namespace remains protected (fail-closed).

Changes

  • go-controller/pkg/ovn/base_event_handler.go - Register EgressFirewall update handler
  • go-controller/pkg/ovn/default_network_controller.go - Add UpdateResource case
  • go-controller/pkg/ovn/egressfirewall.go - Add validation function

Technical Details

Vulnerable Code Flow (4.19/4.20):

Delete existing ACLs → Validate new rules → Create new ACLs (if validation passes)

If validation fails after deletion, no ACLs are present, resulting in unrestricted egress traffic.

Fixed Code Flow (4.21):

Validate new rules → Create new ACLs → Delete stale ACLs (if creation succeeds)

If validation fails, the operation terminates before modifying existing ACLs, preserving the current security posture.

Associated Bug

https://redhat.atlassian.net/browse/OCPBUGS-78765

Upstream Fix

Upstream PR: ovn-kubernetes/ovn-kubernetes#5688
Commit: ecabc89

The fix has been confirmed in the OpenShift downstream repository:

  1. Commit history in release-4.21:
    https://github.com/openshift/ovn-kubernetes/commits/release-4.21/go-controller/pkg/ovn/controller/egressfirewall/egressfirewall.go

  2. Direct commit link:
    ecabc89

Testing

  1. Create a valid EgressFirewall with deny-all rule
  2. Update it with an invalid CIDR (e.g., "random-text")
  3. Verify: Old ACLs remain in place (namespace still protected)
  4. Verify: Status shows "EgressFirewall Rules not correctly applied"
  5. Verify: Traffic that was blocked remains blocked

Backport Conflict

  1. Different controller architectures
  2. OCP 4.20: Uses a monolithic DefaultNetworkController.
    OCP 4.21: Uses a dedicated EFController with new features like caching (cacheEntry), DNS resolvers, network manager interfaces (for UDN support), and different function signatures for ACL building and status updates.

Summary by CodeRabbit

  • Bug Fixes
    • Egress firewall updates now validate new rules before applying changes to avoid partial or inconsistent configurations; invalid rules are rejected and prior state preserved.
  • Refactor
    • Egress firewall processing separated into a build/validation phase and a subsequent apply phase to improve reliability and error handling.
  • Chores
    • Added several Docker build files for different builder/runtime workflows and updated CI/configuration settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: b2b32d56-ba0f-4f5c-89ea-bbbabb2877a5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

.coderabbit.yml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'paths_ignore'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Event routing and update handling for EgressFirewall resources were added and refactored: the base event router now routes EgressFirewall updates to a dedicated update path; the default network controller performs validation-first update logic; egress firewall construction/validation was extracted into a separate build function to avoid side effects before validation.

Changes

Cohort / File(s) Summary
Event Routing
go-controller/pkg/ovn/base_event_handler.go
Added factory.EgressFirewallType to types routed to the dedicated update-function path (prevents fallback to delete-old + add-new).
Update Handler Implementation
go-controller/pkg/ovn/default_network_controller.go
Added EgressFirewall-specific UpdateResource logic: deep-copy new object, validate/build constructs first, update status on validation failure, remove old constructs and metrics then add validated new constructs on success.
Validation Refactor
go-controller/pkg/ovn/egressfirewall.go
Refactored addEgressFirewall into a pure build/validate step buildEgressFirewallConstruct, plus addEgressFirewallWithConstruct that performs locking, cache update, and rule programming only after validation succeeds.
Build & Images
Dockerfile.acl-fix-golang, Dockerfile.customer-build, Dockerfile.customer-build-fedora, Dockerfile.customer-build-working
Added several new multi-stage Dockerfiles to build go-controller binaries (various builder bases) and produce runtime images with labels and correct permissions.
CI / Tooling Config
.coderabbit.yml
Updated CodeRabbit configuration: added schema and reviews settings, moved vendor ignore into path filters, enabled auto-reply and review behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant Event as Event System
    participant Router as Base Handler (Router)
    participant Controller as Default Network Controller
    participant Firewall as Egress Firewall Logic
    participant Status as Status Updater

    Event->>Router: Update EgressFirewall event
    activate Router
    Router->>Router: hasResourceAnUpdateFunc? (EgressFirewall)
    Router->>Controller: UpdateResource(new, old)
    deactivate Router

    activate Controller
    Controller->>Firewall: buildEgressFirewallConstruct(new)
    activate Firewall
    Firewall->>Firewall: validate/build rules
    alt Validation fails
        Firewall-->>Controller: error
        Controller->>Status: update status (validation failed)
        Controller-->>Event: return (no mutation)
    else Validation succeeds
        Firewall-->>Controller: constructed ef object
        Controller->>Firewall: deleteEgressFirewall(old)
        Controller->>Firewall: addEgressFirewallWithConstruct(new, constructed)
        Controller->>Status: update status (success)
        Controller->>Controller: update metrics
    end
    deactivate Firewall
    deactivate Controller
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A hop, a build, then rules I test,
Validate first — no premature quest.
Old fences fall, new paths take flight,
Metrics tally softly through the night.
The rabbit guards changes — tidy and right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: fixing a security vulnerability in egress firewall ACL deletion.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smulje
Once this PR has been reviewed and has the lgtm label, please assign kyrtapz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

Hi @smulje. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@smulje smulje force-pushed the egress-firewall-acl-security-fix branch from f5581e1 to 8a05215 Compare April 16, 2026 10:53
@smulje
Copy link
Copy Markdown
Author

smulje commented Apr 23, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go-controller/pkg/ovn/default_network_controller.go`:
- Around line 1104-1116: The code currently calls
buildEgressFirewallConstruct(...) and later calls addEgressFirewall(...) which
rebuilds the construct again, allowing a TOCTOU window when
deleteEgressFirewall(...) removes old ACLs; to fix, change the update path to
validate once and pass the validated construct into an add helper that skips
rebuild: after successful buildEgressFirewallConstruct(newEgressFirewall) use a
new private method addEgressFirewallLocked(newEgressFirewall,
validatedConstruct) (or an overload of addEgressFirewall that accepts the
constructed result) and call deleteEgressFirewall(oldEgressFirewall) only after
you have the validatedConstruct and then call addEgressFirewallLocked to install
rules, and ensure setEgressFirewallStatus(...) still receives errors from the
single build; update addEgressFirewall (or add a new addEgressFirewallLocked) to
accept and use the provided construct instead of calling
buildEgressFirewallConstruct again to avoid redundant node lookups and eliminate
the race.
- Around line 1097-1125: The update path is double-applying metrics because it
bypasses the centralized Add/Delete metric paths and then also calls
metrics.UpdateEgressFirewallRuleCount again; fix by replacing the direct calls
to deleteEgressFirewall(oldEgressFirewall) and
addEgressFirewall(newEgressFirewall) with the existing metric-aware helpers
(call DeleteResource(oldEgressFirewall) and AddResource(newEgressFirewall) or
the controller methods that wrap metrics), and remove the final
metrics.UpdateEgressFirewallRuleCount(len(new.Spec.Egress)-len(old.Spec.Egress))
call so metrics are only updated by the standard Add/Delete flows and by
setEgressFirewallStatus(new, err) as designed. Ensure references to
deleteEgressFirewall, addEgressFirewall, setEgressFirewallStatus, and
metrics.UpdateEgressFirewallRuleCount are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7ef88023-04b9-4003-8b6e-7535fb2abd87

📥 Commits

Reviewing files that changed from the base of the PR and between b8b321e and 8a05215.

📒 Files selected for processing (3)
  • go-controller/pkg/ovn/base_event_handler.go
  • go-controller/pkg/ovn/default_network_controller.go
  • go-controller/pkg/ovn/egressfirewall.go

Comment thread go-controller/pkg/ovn/default_network_controller.go
Comment thread go-controller/pkg/ovn/default_network_controller.go Outdated
@smulje smulje marked this pull request as draft April 23, 2026 07:40
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2026
@smulje smulje force-pushed the egress-firewall-acl-security-fix branch from 2efd854 to 1e3db44 Compare April 23, 2026 07:41
@smulje
Copy link
Copy Markdown
Author

smulje commented Apr 23, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
Dockerfile.acl-fix-golang (1)

23-29: Avoid hardcoding the fix commit in image metadata.

Line 29 will go stale if this draft PR is amended or rebased. Inject the source revision at build time so audits reflect the actual image contents.

Suggested build-arg label
+ARG FIX_COMMIT=unknown
+
 # Document the fix
 LABEL io.k8s.display-name="ovn kubernetes with ACL security fix" \
       io.k8s.description="EgressFirewall ACL Security Fix (OCPBUGS-78765)" \
       io.openshift.tags="openshift" \
       maintainer="smulje@redhat.com" \
       fix="Validates BEFORE deleting ACLs, fixes TOCTOU, fixes metrics double-counting" \
-      commit="1e3db44fa0bbbcc11abd8c32c3edf724a10f7bca"
+      commit="${FIX_COMMIT}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.acl-fix-golang` around lines 23 - 29, The Dockerfile hardcodes a
commit value in the LABEL (commit="1e3db44...") which will go stale; change the
LABEL to consume a build ARG instead: add an ARG like SOURCE_COMMIT (default
empty or "unknown"), replace the literal commit label with
commit="${SOURCE_COMMIT}", and document using docker build --build-arg
SOURCE_COMMIT=$(git rev-parse --short HEAD) (or your CI variable) so the image
metadata reflects the actual revision; update the LABEL statement that
references commit and ensure ARG SOURCE_COMMIT is declared before the LABEL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile.acl-fix-golang`:
- Around line 3-13: The Dockerfile currently builds a CGO-enabled binary with
the Debian-based golang:1.24 builder and then runs it in an OCP/RHEL runtime
(the RUN cd go-controller && CGO_ENABLED=1 make step produces a binary
incompatible with the FROM quay.io/openshift... base); replace the builder stage
with an OCP/RHEL-compatible Go builder image (use the standard OCP builder that
matches RHEL 9) so the compiled binary links against the same glibc, or if you
must keep golang:1.24, make the apt-get install line use --no-install-recommends
and ensure the toolchain and libc versions are aligned to RHEL 9 to avoid GLIBC
version mismatches when running the binary produced by RUN cd go-controller &&
CGO_ENABLED=1 make.

In `@Dockerfile.customer-build-fedora`:
- Around line 3-12: The builder image must be changed from Fedora to a RHEL-9 /
UBI9 builder so CGO-built binaries (CGO_ENABLED=1 in the go-controller make
step) link against the same glibc version used by the OpenShift 4.20 runtime;
replace the first stage FROM line (currently "FROM fedora:42 AS builder") with a
RHEL9/UBI9 builder image (for example a UBI9 base like
registry.access.redhat.com/ubi9/ubi or another RHEL-9 compatible builder), keep
the WORKDIR/COPY and the CGO_ENABLED=1 make invocation in go-controller, and
ensure the rest of the Dockerfile still uses the runtime "FROM
quay.io/openshift/origin-ovn-kubernetes:4.20".

In `@Dockerfile.customer-build-working`:
- Around line 36-37: The RUN chmod -R g=u /usr/bin /usr/libexec/cni line is too
broad and makes all /usr/bin group-writable; instead scope permissions to only
the copied/hotfix binaries or enforce minimal executable/read bits for those
files: identify the specific artifacts you add to /usr/bin and /usr/libexec/cni
and apply permissions only to those paths (instead of the global /usr/bin), or
change the operation to set group read/execute (not write) on the installed
binaries; update the Dockerfile by replacing the global RUN chmod -R g=u
/usr/bin /usr/libexec/cni with a scoped permission step targeting just your
added files.
- Around line 2-21: The final runtime FROM in the Dockerfile points to a
non-existent/incorrect image (quay.io/openshift/origin-ovn-kubernetes:4.18);
update that final FROM line to use the correct OpenShift base image for
release-4.20 (use the repository referenced in Dockerfile.base), e.g. replace
the final FROM with registry.ci.openshift.org/ocp/4.21:base-rhel9 so the runtime
stage matches the expected OCP base; leave the builder stage (FROM fedora:42 AS
builder) and the CGO_ENABLED=1 make step unchanged.

---

Nitpick comments:
In `@Dockerfile.acl-fix-golang`:
- Around line 23-29: The Dockerfile hardcodes a commit value in the LABEL
(commit="1e3db44...") which will go stale; change the LABEL to consume a build
ARG instead: add an ARG like SOURCE_COMMIT (default empty or "unknown"), replace
the literal commit label with commit="${SOURCE_COMMIT}", and document using
docker build --build-arg SOURCE_COMMIT=$(git rev-parse --short HEAD) (or your CI
variable) so the image metadata reflects the actual revision; update the LABEL
statement that references commit and ensure ARG SOURCE_COMMIT is declared before
the LABEL.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro Plus

Run ID: fd5bcf4b-8ea4-46dc-84bd-bc565602c8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 8a05215 and 743c065.

📒 Files selected for processing (7)
  • .coderabbit.yml
  • Dockerfile.acl-fix-golang
  • Dockerfile.customer-build
  • Dockerfile.customer-build-fedora
  • Dockerfile.customer-build-working
  • go-controller/pkg/ovn/default_network_controller.go
  • go-controller/pkg/ovn/egressfirewall.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • go-controller/pkg/ovn/default_network_controller.go
  • go-controller/pkg/ovn/egressfirewall.go

Comment thread Dockerfile.acl-fix-golang Outdated
Comment thread Dockerfile.customer-build-fedora Outdated
Comment thread Dockerfile.customer-build-working Outdated
Comment thread Dockerfile.customer-build-working Outdated
Egress firewall updates were deleting ACLs before validating new rules,
creating a security gap where namespaces would be left without egress
protection if validation failed.

Changes in this commit:

1. Added buildEgressFirewallConstruct() to validate rules before making
   any database changes

2. Added UpdateResource handler for EgressFirewall that:
   - Validates new rules BEFORE deleting old ACLs
   - Only proceeds with update if validation succeeds
   - Preserves existing ACLs if validation fails
   - Fixes TOCTOU by validating once and reusing the construct
   - Fixes metrics double-counting

3. Registered EgressFirewall in hasResourceAnUpdateFunc() to use the
   UpdateResource path instead of delete-then-add

Result: Invalid EgressFirewall updates now preserve existing ACLs instead
of deleting them, keeping the namespace protected.

Fixes: OCPBUGS-78765
Signed-off-by: Swati Mulje <smulje@redhat.com>
@smulje smulje force-pushed the egress-firewall-acl-security-fix branch 2 times, most recently from 35a91a5 to 4b15ddc Compare April 23, 2026 18:04
@smulje
Copy link
Copy Markdown
Author

smulje commented Apr 24, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

@smulje smulje marked this pull request as ready for review April 24, 2026 04:54
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@openshift-ci openshift-ci Bot requested a review from tssurya April 24, 2026 04:54
@smulje smulje force-pushed the egress-firewall-acl-security-fix branch from 4b15ddc to d14551f Compare April 24, 2026 05:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

@smulje smulje force-pushed the egress-firewall-acl-security-fix branch from d14551f to 35a91a5 Compare April 24, 2026 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant