-
Notifications
You must be signed in to change notification settings - Fork 774
Add Kubernetes mitigation manifest #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ClemDNL
wants to merge
1
commit into
V4bel:master
Choose a base branch
from
ClemDNL:add-kubernetes-mitigation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Kubernetes mitigation | ||
|
|
||
| A self-contained Kubernetes manifest that applies the [Dirty Frag mitigation](../README.md#mitigation) to every Linux node in a cluster. | ||
|
|
||
| ## What it does | ||
|
|
||
| Deploys a DaemonSet (`dirtyfrag-mitigation` in `kube-system`) whose init container — running on every Linux node, including system pools — performs the steps from the disclosure README inside the host's namespaces via `nsenter`: | ||
|
|
||
| 1. Writes `/etc/modprobe.d/disable-dirtyfrag.conf` blacklisting `esp4`, `esp6` and `rxrpc` so they cannot be loaded on demand. | ||
| 2. For each of these modules currently loaded with `refcnt=0`, runs `modprobe -r` to unload it from the live kernel. | ||
| 3. Runs `sync; echo 3 > /proc/sys/vm/drop_caches` to clear any contaminated cached pages. | ||
| 4. If any of these modules is loaded with `refcnt > 0` (in active use), emits a single aggregated Warning [Kubernetes Event](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) (`reason=DirtyFragModulesInUse`) on the affected `Node` listing the in-use modules, so operators can drain and reboot/replace the node. **No auto-cordon.** | ||
|
|
||
| A long-running [`pause`](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#understanding-init-containers) container keeps the pod in `Running` state so the init container is only re-executed on pod recreation — i.e. on each new node that joins the cluster (autoscaling, node-image upgrade, scale-set rolling update). | ||
|
|
||
| ## Apply | ||
|
|
||
| ```bash | ||
| kubectl apply -f https://raw.githubusercontent.com/V4bel/dirtyfrag/master/k8s/dirtyfrag-mitigation.yaml | ||
| kubectl -n kube-system rollout status ds/dirtyfrag-mitigation | ||
| ``` | ||
|
|
||
| Check for nodes that need a drain+reboot to complete the mitigation (modules that were already in use): | ||
|
|
||
| ```bash | ||
| kubectl -n default get events --field-selector reason=DirtyFragModulesInUse | ||
| ``` | ||
|
|
||
| ## Compatibility | ||
|
|
||
| `esp4` and `esp6` provide IPsec ESP transforms; `rxrpc` provides the RxRPC socket family used by AFS. **None of these are required by a typical workload-only Kubernetes cluster.** | ||
|
|
||
| If your cluster does require one of these modules (e.g. a node-level IPsec tunnel, an AFS client running on the host or in a privileged pod), edit the `MODULES` env var in the manifest and remove the affected module(s) before applying — or label-exclude the affected node pool. | ||
|
|
||
| ## Revert (once upstream kernel patches roll out) | ||
|
|
||
| The modprobe drop-in persists for the lifetime of each node. To clean it up from live nodes before deleting the DaemonSet: | ||
|
|
||
| ```bash | ||
| # 1. Flip the init container into cleanup mode and roll the fleet | ||
| kubectl -n kube-system set env ds/dirtyfrag-mitigation CLEANUP_MODE=true | ||
| kubectl -n kube-system rollout restart ds/dirtyfrag-mitigation | ||
| kubectl -n kube-system rollout status ds/dirtyfrag-mitigation | ||
|
|
||
| # 2. Delete the DaemonSet, ServiceAccount and ClusterRole/Binding | ||
| kubectl delete -f https://raw.githubusercontent.com/V4bel/dirtyfrag/master/k8s/dirtyfrag-mitigation.yaml | ||
| ``` | ||
|
|
||
| If you skip step 1, the `/etc/modprobe.d/disable-dirtyfrag.conf` drop-in remains on existing nodes until each is recycled (node-image upgrade, scale-down, or manual `kubectl drain && kubectl delete node`). | ||
|
|
||
| ## Tested with | ||
|
|
||
| - Kubernetes 1.30 on AKS (Azure), in a production environment across staging and production clusters | ||
| - Linux nodes only (the DaemonSet has `nodeSelector: kubernetes.io/os: linux` so Windows nodes are skipped automatically) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,300 @@ | ||
| # Dirty Frag Kubernetes mitigation | ||
| # | ||
| # Disclosure: https://github.com/V4bel/dirtyfrag | ||
| # | ||
| # This manifest applies the Dirty Frag mitigation recommended in the disclosure | ||
| # README to every Linux node in a Kubernetes cluster: | ||
| # | ||
| # printf 'install esp4 /bin/false\ninstall esp6 /bin/false\ninstall rxrpc /bin/false\n' \ | ||
| # > /etc/modprobe.d/dirtyfrag.conf | ||
| # rmmod esp4 esp6 rxrpc 2>/dev/null | ||
| # echo 3 > /proc/sys/vm/drop_caches | ||
| # | ||
| # It runs as a DaemonSet so that: | ||
| # - The mitigation is applied on every existing node, and | ||
| # - It is automatically re-applied to any new node that joins the cluster | ||
| # (autoscaling, node-image upgrade, scale-set rolling update, etc.) before | ||
| # workloads schedule onto it. | ||
| # | ||
| # How it works: | ||
| # - An init container enters the host's PID, mount, network, IPC and UTS | ||
| # namespaces with `nsenter -t 1 -m -u -i -n -p` and: | ||
| # 1. Writes /etc/modprobe.d/disable-dirtyfrag.conf so esp4, esp6 and | ||
| # rxrpc cannot be loaded on demand. | ||
| # 2. For each module currently loaded with refcnt=0, runs `modprobe -r` | ||
| # to unload it from the live kernel. | ||
| # 3. Runs `sync; echo 3 > /proc/sys/vm/drop_caches` to clear any | ||
| # contaminated cached pages (gated on DROP_CACHES, default true). | ||
| # 4. If any module remains loaded with refcnt > 0, emits a single | ||
| # aggregated Warning Kubernetes Event (reason=DirtyFragModulesInUse) | ||
| # on the Node listing the in-use modules so operators can drain and | ||
| # reboot/replace the node. This DaemonSet does NOT auto-cordon. | ||
| # - A long-running `pause` container keeps the pod in Running state so the | ||
| # init container is only re-executed on pod recreation (i.e. on each new | ||
| # node). | ||
| # | ||
| # Compatibility note: | ||
| # esp4 and esp6 provide IPsec ESP transforms; rxrpc provides the RxRPC | ||
| # socket family used by AFS. If any of your workloads (or the host network) | ||
| # require these modules, do NOT apply this manifest as-is — either remove | ||
| # the affected module(s) from the MODULES env var below, or label-exclude | ||
| # the affected node pool. On a typical workload-only Kubernetes cluster | ||
| # none of these modules are in use. | ||
| # | ||
| # Reverting once upstream kernel patches roll out: | ||
| # 1. Run a cleanup pass first to remove the modprobe drop-in from live | ||
| # nodes (the init container's CLEANUP_MODE branch removes the file | ||
| # and reloads modprobe state): | ||
| # | ||
| # kubectl -n kube-system set env ds/dirtyfrag-mitigation CLEANUP_MODE=true | ||
| # kubectl -n kube-system rollout restart ds/dirtyfrag-mitigation | ||
| # kubectl -n kube-system rollout status ds/dirtyfrag-mitigation | ||
| # | ||
| # 2. Then delete the resources: | ||
| # | ||
| # kubectl delete -f dirtyfrag-mitigation.yaml | ||
| # | ||
| # If you skip step 1, the modprobe drop-in remains on existing nodes until | ||
| # each is recycled (node-image upgrade, scale-down, or manual drain+delete). | ||
| # | ||
| # Tested with Kubernetes 1.27+ on AKS, EKS, and GKE (Linux nodes only). | ||
| --- | ||
| apiVersion: v1 | ||
| kind: ServiceAccount | ||
| metadata: | ||
| name: dirtyfrag-mitigation | ||
| namespace: kube-system | ||
| labels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| app.kubernetes.io/component: cve-mitigation | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRole | ||
| metadata: | ||
| name: dirtyfrag-mitigation | ||
| labels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| app.kubernetes.io/component: cve-mitigation | ||
| rules: | ||
| # Read node metadata so we can address Events to the running node. | ||
| - apiGroups: [""] | ||
| resources: ["nodes"] | ||
| verbs: ["get"] | ||
| # Emit Warning Events when any module is in use (refcount > 0). | ||
| - apiGroups: [""] | ||
| resources: ["events"] | ||
| verbs: ["create", "patch"] | ||
| - apiGroups: ["events.k8s.io"] | ||
| resources: ["events"] | ||
| verbs: ["create", "patch"] | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRoleBinding | ||
| metadata: | ||
| name: dirtyfrag-mitigation | ||
| labels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| app.kubernetes.io/component: cve-mitigation | ||
| roleRef: | ||
| apiGroup: rbac.authorization.k8s.io | ||
| kind: ClusterRole | ||
| name: dirtyfrag-mitigation | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: dirtyfrag-mitigation | ||
| namespace: kube-system | ||
| --- | ||
| apiVersion: apps/v1 | ||
| kind: DaemonSet | ||
| metadata: | ||
| name: dirtyfrag-mitigation | ||
| namespace: kube-system | ||
| labels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| app.kubernetes.io/component: cve-mitigation | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| updateStrategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxUnavailable: 100% # init container is fast; roll the whole fleet at once | ||
| template: | ||
| metadata: | ||
| labels: | ||
| app.kubernetes.io/name: dirtyfrag-mitigation | ||
| app.kubernetes.io/component: cve-mitigation | ||
| spec: | ||
| hostPID: true | ||
| priorityClassName: system-node-critical | ||
| serviceAccountName: dirtyfrag-mitigation | ||
| automountServiceAccountToken: true | ||
| # Run on every Linux node, including system/critical pools. | ||
| nodeSelector: | ||
| kubernetes.io/os: linux | ||
| tolerations: | ||
| - operator: Exists | ||
| terminationGracePeriodSeconds: 5 | ||
| initContainers: | ||
| - name: apply-mitigation | ||
| image: busybox:1.36.1 | ||
| imagePullPolicy: IfNotPresent | ||
| securityContext: | ||
| privileged: true | ||
| runAsUser: 0 | ||
| env: | ||
| - name: NODE_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: spec.nodeName | ||
| # Node Events follow the kubelet convention of being created in | ||
| # the `default` namespace; cluster-scoped objects like Nodes | ||
| # cannot have a namespaced involvedObject reference. | ||
| - name: EVENT_NAMESPACE | ||
| value: "default" | ||
| # Set CLEANUP_MODE=true (e.g. via `kubectl set env`) to flip the | ||
| # init container into removing the modprobe drop-in instead of | ||
| # writing it. Use this for a full rollout pass before deleting | ||
| # the DaemonSet, to clean up live nodes. | ||
| - name: CLEANUP_MODE | ||
| value: "false" | ||
| # Set DROP_CACHES=false to skip `echo 3 > /proc/sys/vm/drop_caches` | ||
| # (the page-cache flush after unloading modules). Default true, | ||
| # matching the disclosure's recommended mitigation. | ||
| - name: DROP_CACHES | ||
| value: "true" | ||
| # Space-separated list of modules to blacklist + unload. Edit this | ||
| # if you need to keep one of these modules available (e.g. IPsec | ||
| # via esp4/esp6, AFS via rxrpc). | ||
| - name: MODULES | ||
| value: "esp4 esp6 rxrpc" | ||
| command: ["/bin/sh", "-c"] | ||
| args: | ||
| - | | ||
| set -eu | ||
|
|
||
| MODPROBE_FILE=/etc/modprobe.d/disable-dirtyfrag.conf | ||
|
|
||
| if [ "${CLEANUP_MODE}" = "true" ]; then | ||
| echo "[dirtyfrag] CLEANUP mode on node ${NODE_NAME}: removing mitigation" | ||
| nsenter -t 1 -m -u -i -n -p -- sh -c "rm -f ${MODPROBE_FILE}; depmod -a 2>/dev/null || true; for m in ${MODULES}; do modprobe -r \$m 2>/dev/null || true; done; true" | ||
| echo "[dirtyfrag] cleanup complete on ${NODE_NAME}" | ||
| exit 0 | ||
| fi | ||
|
|
||
| echo "[dirtyfrag] applying mitigation on node ${NODE_NAME} for modules: ${MODULES}" | ||
|
|
||
| # 1. Persist modprobe blacklist so the modules cannot be loaded on demand. | ||
| # Rewrite the file from scratch (idempotent) to keep ordering stable | ||
| # and match the disclosure's recommended single-file form. | ||
| nsenter -t 1 -m -u -i -n -p -- sh -c " | ||
| set -eu | ||
| TMP=\$(mktemp ${MODPROBE_FILE}.XXXXXX) | ||
| for m in ${MODULES}; do | ||
| printf 'install %s /bin/false\n' \"\$m\" >> \"\$TMP\" | ||
| done | ||
| if [ -f ${MODPROBE_FILE} ] && cmp -s \"\$TMP\" ${MODPROBE_FILE}; then | ||
| rm -f \"\$TMP\" | ||
| echo '[dirtyfrag] ${MODPROBE_FILE} already up to date' | ||
| else | ||
| mv \"\$TMP\" ${MODPROBE_FILE} | ||
| chmod 0644 ${MODPROBE_FILE} | ||
| echo '[dirtyfrag] wrote ${MODPROBE_FILE}' | ||
| fi | ||
| depmod -a 2>/dev/null || true | ||
| " | ||
|
|
||
| # 2. For each module: if currently loaded, try to unload. Track in-use | ||
| # modules so we can emit a single aggregated Warning Event. | ||
| IN_USE="" | ||
| for m in ${MODULES}; do | ||
| REFCNT_PATH=/sys/module/${m}/refcnt | ||
| if nsenter -t 1 -m -u -i -n -p -- test -f "${REFCNT_PATH}"; then | ||
| REFCNT=$(nsenter -t 1 -m -u -i -n -p -- cat "${REFCNT_PATH}") | ||
| echo "[dirtyfrag] ${m} is loaded with refcnt=${REFCNT}" | ||
|
|
||
| if [ "${REFCNT}" = "0" ]; then | ||
| if nsenter -t 1 -m -u -i -n -p -- modprobe -r ${m} 2>&1; then | ||
| echo "[dirtyfrag] successfully unloaded ${m}" | ||
| else | ||
| echo "[dirtyfrag] WARNING: rmmod ${m} failed despite refcnt=0" | ||
| IN_USE="${IN_USE}${IN_USE:+,}${m}(rmmod-failed)" | ||
| fi | ||
| else | ||
| echo "[dirtyfrag] WARNING: ${m} in use (refcnt=${REFCNT}); node ${NODE_NAME} requires drain+reboot for full mitigation" | ||
| IN_USE="${IN_USE}${IN_USE:+,}${m}(refcnt=${REFCNT})" | ||
| fi | ||
| else | ||
| echo "[dirtyfrag] ${m} is not loaded; modprobe blacklist will prevent future loads" | ||
| fi | ||
| done | ||
|
|
||
| # 3. Drop page caches to clear any contaminated cached pages, per the | ||
| # disclosure's mitigation guidance. Best-effort. | ||
| if [ "${DROP_CACHES}" = "true" ]; then | ||
| if nsenter -t 1 -m -u -i -n -p -- sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' 2>/dev/null; then | ||
| echo "[dirtyfrag] dropped page caches" | ||
| else | ||
| echo "[dirtyfrag] WARNING: failed to drop page caches" | ||
| fi | ||
| fi | ||
|
|
||
| # 4. If any module was in-use, emit a single aggregated Warning Event | ||
| # on the Node so operators get an actionable signal. | ||
| # Best-effort: do not fail the init container if the API call fails. | ||
| # BusyBox `wget --no-check-certificate` is used because BusyBox wget | ||
| # does not support `--ca-certificate`; the bearer token still | ||
| # authenticates us to the API server, and the endpoint is the | ||
| # in-cluster `kubernetes.default.svc` ClusterIP, so skipping TLS | ||
| # chain validation is an accepted trade-off for a best-effort emitter. | ||
| if [ -n "${IN_USE}" ]; then | ||
| TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) | ||
| APISERVER=https://kubernetes.default.svc | ||
| NODE_UID=$(wget -qO- --no-check-certificate \ | ||
| --header="Authorization: Bearer ${TOKEN}" \ | ||
| "${APISERVER}/api/v1/nodes/${NODE_NAME}" 2>/dev/null | \ | ||
| sed -n 's/.*"uid":[[:space:]]*"\([^"]*\)".*/\1/p' | head -1 || true) | ||
| TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
| EVENT_NAME="dirtyfrag-mitigation.${NODE_NAME}.$(date +%s)" | ||
| EVENT_BODY=$(cat <<EOF | ||
| {"apiVersion":"v1","kind":"Event","metadata":{"name":"${EVENT_NAME}","namespace":"${EVENT_NAMESPACE}"},"involvedObject":{"apiVersion":"v1","kind":"Node","name":"${NODE_NAME}","uid":"${NODE_UID}"},"reason":"DirtyFragModulesInUse","message":"Dirty Frag: the following kernel modules are in use and could not be unloaded: ${IN_USE}. Drain and reboot/replace this node to fully mitigate.","type":"Warning","firstTimestamp":"${TS}","lastTimestamp":"${TS}","count":1,"source":{"component":"dirtyfrag-mitigation"}} | ||
| EOF | ||
| ) | ||
| if wget -qO- --no-check-certificate \ | ||
| --header="Authorization: Bearer ${TOKEN}" \ | ||
| --header="Content-Type: application/json" \ | ||
| --post-data="${EVENT_BODY}" \ | ||
| "${APISERVER}/api/v1/namespaces/${EVENT_NAMESPACE}/events" >/dev/null 2>&1; then | ||
| echo "[dirtyfrag] emitted Warning Event ${EVENT_NAME} (in-use: ${IN_USE})" | ||
| else | ||
| echo "[dirtyfrag] WARNING: failed to emit Kubernetes Event" | ||
| fi | ||
| fi | ||
|
|
||
| echo "[dirtyfrag] mitigation complete on ${NODE_NAME}" | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 16Mi | ||
| limits: | ||
| cpu: 100m | ||
| memory: 64Mi | ||
| containers: | ||
| # Long-running placeholder so the pod stays Running and the init | ||
| # container is re-executed only on pod recreate (i.e. on each new node). | ||
| - name: pause | ||
| image: registry.k8s.io/pause:3.10.1 | ||
| imagePullPolicy: IfNotPresent | ||
| resources: | ||
| requests: | ||
| cpu: 1m | ||
| memory: 8Mi | ||
| limits: | ||
| cpu: 10m | ||
| memory: 16Mi | ||
| securityContext: | ||
| allowPrivilegeEscalation: false | ||
| readOnlyRootFilesystem: true | ||
| capabilities: | ||
| drop: ["ALL"] | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for entering all host namespaces here?
Writing configuration for modprobe could he achieved via a hostMount to
/etc/modprobe.dand module management can be achieved by granting the container the SYS_MODULE capability.