Skip to content

feat: persist IntentFailed for node pool version validation errors#5378

Open
ahitacat wants to merge 1 commit into
Azure:mainfrom
ahitacat:mark-upgrade-operation
Open

feat: persist IntentFailed for node pool version validation errors#5378
ahitacat wants to merge 1 commit into
Azure:mainfrom
ahitacat:mark-upgrade-operation

Conversation

@ahitacat
Copy link
Copy Markdown
Collaborator

@ahitacat ahitacat commented May 25, 2026

ARO-26304

What

Persist node pool version validation errors as IntentFailed conditions on the controller document instead of returning them as transient errors, and teach the operation controller to read those conditions to determine operation outcomes.

Similar work was done for control plane upgrade controller #5042

Why

Previously, version validation errors (downgrades, Cincinnati VersionNotFound, missing upgrade paths, exceeding control plane version, skipping minor versions) were returned as errors from SyncOnce, causing the controller to retry them indefinitely. These are user errors that won't resolve on retry. By persisting them as IntentFailed conditions, the operation controller can detect them and fail the operation with a clear error message, while still retrying transient Cincinnati errors (e.g. 503).

Testing

  • Added new TestNodePoolVersionSyncer_SyncOnce_IntentFailed table with 6 test cases covering: successful resolution (IntentFailed=False), downgrade, Cincinnati VersionNotFound, Cincinnati upstream error (transient), exceeds control plane, and skipping minor versions
  • Added operation controller tests covering: exact version match succeeds, IntentFailed marks operation failed, stale IntentFailed from previous version is ignored, 29s timeout for version resolution, and version mismatch without IntentFailed stays Accepted
  • Updated TestNodePoolVersionSyncer_SyncOnce_DesiredVersionUnchangedOnFailure_ChangedOnSuccess to reflect that VersionNotFound now returns nil (persists IntentFailed) instead of returning an error

Special notes for your reviewer

  • service_provider_cluster_conditions.go was renamed to service_provider_conditions.go since conditions now apply to both clusters and node pools
  • The operation controller uses a [version X.Y.Z] prefix in the IntentFailed condition message to scope conditions to a specific customer version request, preventing stale errors from blocking new requests. There could be another approach to consider a dedicated field for cleaner version scoping. feat: persist IntentFailed for node pool version validation errors #5378 (comment)
  • The operation controller waits twice (59s timout) the informers relist period used by the nodepool version controller. This would cover most of the calculation edge cases in which the new desiredVersion has not yet calculated.
  • subscriptionLister was added to the version controller for feature flag resolution (experimental release features)

PR Checklist

  • PR is scoped to a single task (no mixed concerns)
  • Title follows Conventional Commits format
  • Summary explains the "Why" behind the change
  • Linked to relevant ticket/issue
  • Screenshots included (if graph/UI/metrics changes)
  • Self-reviewed the diff
  • CI/CD checks are passing (ignore Tide)
  • Draft PR used for WIP (if applicable)
  • Commit history is clean (rebased/squashed)
  • Tricky code blocks are commented
  • Specific reviewers tagged
  • All comment threads resolved before merge

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ahitacat
Once this PR has been reviewed and has the lgtm label, please assign mbarnes for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

// The condition message is formatted as: "[version X.Y.Z] <error details>"
// This allows version-specific condition scoping without modifying the API type.
// TODO(PR review): Consider adding a new field to ControllerStatus (LastObservedCustomerDesiredVersion) for cleaner version tracking.
isRelevantIntentFailed := intentFailedCondition != nil &&
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this prefix and calculate we could have add it in the controllerStatus. However, as this controllerStatus is shared among cluster and nodepool I didn't want to add conditions that weren't shared.

To the reviewers: Is this approach correct? or should I opt for adding a new field?

Copy link
Copy Markdown
Collaborator

@JameelB JameelB May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I opt for adding a new field?

do you mean adding a LastObservedCustomerDesiredVersion to Controller.Status that you mention in the comment above? Is the Controller.Status shared with anything else apart from cluster/node pool?

My personal preference would be using LastObservedCustomerDesiredVersion since it could be useful for the cluster too? If we can't change the api though, i'm happy with what's proposed here since the prefix format is predictable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean adding a LastObservedCustomerDesiredVersion to Controller.Status that you mention in the comment above?

Yes, I mean that. The controller.Status is not shared yet, but there are no many controllers and I thought it can be potentially shared.
I opted here to no add that into an object I don't know how it will be used. Although I didn't like to parse a message to get this information.

@miguelsorianod do you have any opinion about this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this we opted to not based this in the message prefix wait at least 2 runs of the informers used by the controller in case of the condition is nil or intentFailed. This should be enough for the case I was trying to solve.
I'm testing this manually.

Copy link
Copy Markdown
Collaborator

@miguelsorianod miguelsorianod May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed yesterday, let's do the following approach:

When IntentFailed is set to false, we ignore it until at least there's been double the nodepool/serviceprovidernodepool informer relist period time.

This would cover most of the cases. If it's not enough we could consider increasing the time.
The time increase should only impact when there's an upgrade, and not during all update operations, as well as the specific case where the intent failed within the upgrade.
The increased experience time occurs when there's a failure so experiencing an increased time there is less impactful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manual tested this. Created a node pool with version 4.20.8, upgrade it to 4.20.9 that it doesn't exists in Cincinnati. The controller failed, the intentFailed condition was set and the operation was marked as failed.

{
    "id": "/providers/Microsoft.RedHat
OpenShift/locations/westus3/hcpOperationStatuses/dff9145a-4559-4306-a64c-cf3727cf254b",
    "name": "dff9145a-4559-4306-a64c-cf3727cf254b",
    "status": "Failed",
    "startTime": "2026-05-27T11:03:16.203534595Z",
    "endTime": "2026-05-27T11:03:29.915339813Z",
    "error": {
        "code": "InvalidRequestContent",
        "message": "version 4.20.9 not found in Cincinnati channel stable-4.20"
    }
}

Then upgrade it to 4.20.15 and the operation succeeded.

@ahitacat ahitacat changed the title backend: persist IntentFailed for node pool version validation errors feat: persist IntentFailed for node pool version validation errors May 25, 2026
@ahitacat ahitacat marked this pull request as ready for review May 25, 2026 10:58
Copilot AI review requested due to automatic review settings May 25, 2026 10:58
@openshift-ci openshift-ci Bot requested review from geoberle and miguelsorianod May 25, 2026 10:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates node pool version control flow to persist user-facing version resolution failures via a controller condition, and adjusts operation processing to interpret that condition during node pool update operations.

Changes:

  • Persist/clear IntentFailed condition from NodePoolVersion syncer based on desired version resolution outcomes.
  • Enhance node pool update operation controller to gate failure/timeout behavior based on resolved desired version vs customer requested version.
  • Expand tests to cover IntentFailed persistence, stale condition handling, and mismatch timeout behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/api/service_provider_cluster_conditions.go Clarifies condition/Reason docs to apply to both cluster and node pool resources.
backend/pkg/controllers/upgradecontrollers/nodepool_version_controller.go Persists IntentFailed on user errors; clears on success; uses subscription lister for feature gating.
backend/pkg/controllers/upgradecontrollers/nodepool_version_controller_test.go Adds/updates tests for IntentFailed persistence and adjusts validation call signature.
backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Introduces desired-version resolution state + timeout logic and merges with CS state to pick worst outcome.
backend/pkg/controllers/operationcontrollers/operation_node_pool_update_test.go Adds coverage for exact match, version mismatch, stale IntentFailed, and timeout paths.
backend/pkg/app/backend.go Wires subscriptionLister into the node pool version controller.

Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
Comment thread backend/pkg/controllers/upgradecontrollers/nodepool_version_controller.go Outdated
Comment thread backend/pkg/controllers/upgradecontrollers/nodepool_version_controller_test.go Outdated
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch 2 times, most recently from 826e622 to b2aebc6 Compare May 27, 2026 08:19
Copilot AI review requested due to automatic review settings May 27, 2026 08:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
Comment thread backend/pkg/controllers/upgradecontrollers/nodepool_version_controller.go Outdated
Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from b2aebc6 to 08d11a8 Compare May 27, 2026 08:50
Copilot AI review requested due to automatic review settings May 27, 2026 09:04
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from 08d11a8 to 6d363e1 Compare May 27, 2026 09:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
// Stay Accepted while resolution runs; fail once elapsed exceeds 59s from the first
// time this process observed the mismatch for this operation, so a
// controller restart does not immediately fail long-running operations.
if intentFailedCondition == nil || intentFailedCondition.Status != metav1.ConditionTrue || intentFailedCondition.Reason != api.VersionUpgradeNotAcceptedReason {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, when there's no condition (nil) this would mark the operation as failed which shouldn't be the case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I forgot about this. I have added a check when it is nil to keep the operation Accepted until it is resolved.

@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from 6d363e1 to 4a62267 Compare May 27, 2026 09:34
Copilot AI review requested due to automatic review settings May 27, 2026 11:03
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from 4a62267 to 3f9d317 Compare May 27, 2026 11:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Copilot AI review requested due to automatic review settings May 28, 2026 10:51
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from 9a7079e to 8999840 Compare May 28, 2026 10:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread backend/pkg/controllers/upgradecontrollers/nodepool_version_controller.go Outdated
Comment thread backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go Outdated
Comment thread backend/pkg/controllers/upgradecontrollers/nodepool_version_controller_test.go Outdated
Refactor the nodepool version controller to persist validation errors
(downgrades, Cincinnati VersionNotFound, missing upgrade paths) as
IntentFailed conditions on the controller document instead of returning
them as transient errors. Transient Cincinnati errors (e.g. 503) are
still returned for retry. Clear IntentFailed on successful validation.

Add the operation controller logic to read IntentFailed conditions and
determine the operation state, with a 59s timeout for version resolution
before marking the operation as failed. Making the nodepool version
controller being able to run twice the informers.

Rename service_provider_cluster_conditions.go to
service_provider_conditions.go to reflect that conditions now apply to
both clusters and node pools.

Pass subscriptionLister to the version controller for feature flag
resolution and fix all standalone SyncOnce tests that were missing it.

Add new table for test intentfailed in the nodepool version controller.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Alba Hita Catala <ahitacat@redhat.com>
@ahitacat ahitacat force-pushed the mark-upgrade-operation branch from 8999840 to df9b731 Compare May 28, 2026 11:09
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 30, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants