Skip to content

[FEA]: Allow NodeWright to add a persistent node cordon #284

Description

@natherz97

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request?

High

Component

Operator (controller-manager)

Problem description

We have a use-case that requires new nodes to be cordoned until a validation test succeeds. We would like NodeWright to cordon new nodes before it removes the runtime-required taint. After a node is cordoned and NodeWright removes the runtime-required taint, an external component would be responsible for uncordoning the node after the validation test succeeds.

Desired workflow:

  1. A new node launches with the runtime-required taint
  2. NodeWright runs all runtime-required SCRs and cordons the given node
  3. NodeWright removes the runtime-required taint from the node
  4. GPU Operator operands which do not tolerate the runtime-required taint deploy
  5. A test of validation tests succeed against the node and the node is uncordoned

Why can't an external component handle cordoning the node?

NodeWright currently cordons and uncordons nodes which have SRCs with package interrupts. NodeWright will uncordon nodes after the package interrupt completes even if the nodes were previously cordoned. If NodeWright was updated to not uncordon nodes which were externally cordoned, there would be a race with the external component cordoning the node and any package interrupts completing. Note that nodes cannot be configured to launch as cordoned and either need to be cordoned after the node object is created or as part of a mutating webhook on create node requests. Having NodeWright remove the taint and apply the cordon ensures that the node will always be gated from workloads prior to the validation test succeeding.

Why can't we use a different taint rather than a cordon?

NodeWright could be updated to tolerate an additional taint in addition to the runtime-required taint without issue. This taint could be applied to new nodes on start-up. However, all components which run after NodeWright would need to be updated to explicitly tolerate this new validation taint (and continue not tolerating the runtime-required taint). NodeWright has the benefit of running first which avoids the need to have external components (other than core networking addons) tolerate its taints. The validation test requires ~10 components to be running including GPU Operator and Network Operator daemonsets so toleration management would not be trivial. A cordon avoids this issue because daemonset pods automatically tolerate the taint corresponding to cordoned nodes.

Why can't we use the same taint?

A final alternative to having NodeWright cordon the node or use a different taint would be to keep the node tainted with the existing runtime-required taint until after the validation test succeeds (suppose if we added a runtime-required SCR which checked for the validation test completing). This isn't possible because the existing runtime-required taint is used as an ordering mechanism to deploy GPU Operator after NodeWright completes. If this taint persisted, we would need to implement a new mechanism to deploy GPU Operator after its prerequisite SCRs complete.

Feature description

This feature to cordon nodes prior to removing the runtime-required taint could be implemented with either:

  1. A global operator setting where the runtime-required taint is swapped with a cordon
  2. An SRC which cordons the node

For option 1, a user would need configure the operator to add the cordon any time it removes the runtime-required taint. This has the benefit that only new nodes with the runtime-required taint existing (from either a node pool setting or from the autoTaintNewNodes) will be cordoned. The the alternative approach with a dedicated SRC would result in both new and existing nodes targeted by new SRC being cordoned.

For option 2, we allow an SRC to apply a per-node setting with a K8s API call (setting spec.unschedulable on the given node object) rather than the existing mode where an on-host setting is modified (by creating a pod which targets the given node to modify the node's state). Since a K8s API call does not need to originate from the target node (for example by running a bash script on the target node which executes a kubectl command), we will assume that the K8s API call originates from the NodeWright operator. This allows the NodeWright operator to leverage its existing service account credentials which include cordon permission from the existing package interrupt behavior. Switching to a cordon node request rather than a create pod request which targets the given node also prevents a K8s client + credentials from being needed in NodeWright pods. Long-term, this pattern could be generalized to include other K8s API requests for modifying other node objects.

Describe your ideal solution

Option 1: add operator functionality to swap the runtime-required taint with a cordon

  1. Add a new operator setting which results in the runtime-required taint being replaced with a cordon.
  2. After all runtime-required SRCs complete for new nodes with the runtime-required taint present, the node will be cordoned prior to the taint being removed.
  3. Any existing nodes without the runtime-required taint will remain uncordoned because there's no taint to remove after any SRC completions.

Example operator property:

        containers:
        - command:
          - /manager
          env:
...
          - name: RUNTIME_REQUIRED_TAINT
            value: skyhook.nvidia.com=runtime-required:NoSchedule
          - name: CORDON_POST_RUNTIME
            value: true

Option 2: add SRC functionality to cordon nodes

  1. Add a new SRC package type which results in a node being cordoned.
  2. If we update NodeWright to not uncordon nodes which were cordoned before package interrupts are processed, the order in which this package is processed relative to other runtime-required SRCs does not matter since in all cases, the node will be cordoned prior to the runtime-required taint being removed.
  3. If we do not update NodeWright to not uncordon nodes which were cordoning before package interrupts, this cordon SRC package has to be processed last. We could either require a user to correctly configure the priority or special case this SRC package type to always be last regardless of the priority.
  4. In either case, the node will be cordoned prior to the last runtime-required SRC being processed which will result in the node being cordoned when the runtime-required taint is removed.

Example CRD with a new cordon field:

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
  labels:
    app.kubernetes.io/part-of: skyhook-operator
    app.kubernetes.io/created-by: skyhook-operator
  name: node-cordon
spec:
  runtimeRequired: true
  cordon: true
  packages: {}

The cordon field being true results in the node being cordoned. If the value is false, the cordon state is not modified. The node will remain cordoned even if the Skyhook is deleted or no longer targets a given node.

Alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow Skyhook's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

Labels

component/operatorSkyhook operator (controller-manager)needs-triageAwaiting initial triage by a maintainer
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions