feat: persist IntentFailed for node pool version validation errors#5378
feat: persist IntentFailed for node pool version validation errors#5378ahitacat wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ahitacat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Skipping CI for Draft Pull Request. |
| // The condition message is formatted as: "[version X.Y.Z] <error details>" | ||
| // This allows version-specific condition scoping without modifying the API type. | ||
| // TODO(PR review): Consider adding a new field to ControllerStatus (LastObservedCustomerDesiredVersion) for cleaner version tracking. | ||
| isRelevantIntentFailed := intentFailedCondition != nil && |
There was a problem hiding this comment.
Instead of adding this prefix and calculate we could have add it in the controllerStatus. However, as this controllerStatus is shared among cluster and nodepool I didn't want to add conditions that weren't shared.
To the reviewers: Is this approach correct? or should I opt for adding a new field?
There was a problem hiding this comment.
should I opt for adding a new field?
do you mean adding a LastObservedCustomerDesiredVersion to Controller.Status that you mention in the comment above? Is the Controller.Status shared with anything else apart from cluster/node pool?
My personal preference would be using LastObservedCustomerDesiredVersion since it could be useful for the cluster too? If we can't change the api though, i'm happy with what's proposed here since the prefix format is predictable.
There was a problem hiding this comment.
do you mean adding a LastObservedCustomerDesiredVersion to Controller.Status that you mention in the comment above?
Yes, I mean that. The controller.Status is not shared yet, but there are no many controllers and I thought it can be potentially shared.
I opted here to no add that into an object I don't know how it will be used. Although I didn't like to parse a message to get this information.
@miguelsorianod do you have any opinion about this?
There was a problem hiding this comment.
After reviewing this we opted to not based this in the message prefix wait at least 2 runs of the informers used by the controller in case of the condition is nil or intentFailed. This should be enough for the case I was trying to solve.
I'm testing this manually.
There was a problem hiding this comment.
As discussed yesterday, let's do the following approach:
When IntentFailed is set to false, we ignore it until at least there's been double the nodepool/serviceprovidernodepool informer relist period time.
This would cover most of the cases. If it's not enough we could consider increasing the time.
The time increase should only impact when there's an upgrade, and not during all update operations, as well as the specific case where the intent failed within the upgrade.
The increased experience time occurs when there's a failure so experiencing an increased time there is less impactful.
There was a problem hiding this comment.
I manual tested this. Created a node pool with version 4.20.8, upgrade it to 4.20.9 that it doesn't exists in Cincinnati. The controller failed, the intentFailed condition was set and the operation was marked as failed.
{
"id": "/providers/Microsoft.RedHat
OpenShift/locations/westus3/hcpOperationStatuses/dff9145a-4559-4306-a64c-cf3727cf254b",
"name": "dff9145a-4559-4306-a64c-cf3727cf254b",
"status": "Failed",
"startTime": "2026-05-27T11:03:16.203534595Z",
"endTime": "2026-05-27T11:03:29.915339813Z",
"error": {
"code": "InvalidRequestContent",
"message": "version 4.20.9 not found in Cincinnati channel stable-4.20"
}
}
Then upgrade it to 4.20.15 and the operation succeeded.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates node pool version control flow to persist user-facing version resolution failures via a controller condition, and adjusts operation processing to interpret that condition during node pool update operations.
Changes:
- Persist/clear
IntentFailedcondition fromNodePoolVersionsyncer based on desired version resolution outcomes. - Enhance node pool update operation controller to gate failure/timeout behavior based on resolved desired version vs customer requested version.
- Expand tests to cover IntentFailed persistence, stale condition handling, and mismatch timeout behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/api/service_provider_cluster_conditions.go | Clarifies condition/Reason docs to apply to both cluster and node pool resources. |
| backend/pkg/controllers/upgradecontrollers/nodepool_version_controller.go | Persists IntentFailed on user errors; clears on success; uses subscription lister for feature gating. |
| backend/pkg/controllers/upgradecontrollers/nodepool_version_controller_test.go | Adds/updates tests for IntentFailed persistence and adjusts validation call signature. |
| backend/pkg/controllers/operationcontrollers/operation_node_pool_update.go | Introduces desired-version resolution state + timeout logic and merges with CS state to pick worst outcome. |
| backend/pkg/controllers/operationcontrollers/operation_node_pool_update_test.go | Adds coverage for exact match, version mismatch, stale IntentFailed, and timeout paths. |
| backend/pkg/app/backend.go | Wires subscriptionLister into the node pool version controller. |
826e622 to
b2aebc6
Compare
b2aebc6 to
08d11a8
Compare
08d11a8 to
6d363e1
Compare
| // Stay Accepted while resolution runs; fail once elapsed exceeds 59s from the first | ||
| // time this process observed the mismatch for this operation, so a | ||
| // controller restart does not immediately fail long-running operations. | ||
| if intentFailedCondition == nil || intentFailedCondition.Status != metav1.ConditionTrue || intentFailedCondition.Reason != api.VersionUpgradeNotAcceptedReason { |
There was a problem hiding this comment.
As we discussed, when there's no condition (nil) this would mark the operation as failed which shouldn't be the case.
There was a problem hiding this comment.
Indeed, I forgot about this. I have added a check when it is nil to keep the operation Accepted until it is resolved.
6d363e1 to
4a62267
Compare
4a62267 to
3f9d317
Compare
3f9d317 to
9a7079e
Compare
9a7079e to
8999840
Compare
Refactor the nodepool version controller to persist validation errors (downgrades, Cincinnati VersionNotFound, missing upgrade paths) as IntentFailed conditions on the controller document instead of returning them as transient errors. Transient Cincinnati errors (e.g. 503) are still returned for retry. Clear IntentFailed on successful validation. Add the operation controller logic to read IntentFailed conditions and determine the operation state, with a 59s timeout for version resolution before marking the operation as failed. Making the nodepool version controller being able to run twice the informers. Rename service_provider_cluster_conditions.go to service_provider_conditions.go to reflect that conditions now apply to both clusters and node pools. Pass subscriptionLister to the version controller for feature flag resolution and fix all standalone SyncOnce tests that were missing it. Add new table for test intentfailed in the nodepool version controller. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Alba Hita Catala <ahitacat@redhat.com>
8999840 to
df9b731
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ARO-26304
What
Persist node pool version validation errors as
IntentFailedconditions on the controller document instead of returning them as transient errors, and teach the operation controller to read those conditions to determine operation outcomes.Similar work was done for control plane upgrade controller #5042
Why
Previously, version validation errors (downgrades, Cincinnati
VersionNotFound, missing upgrade paths, exceeding control plane version, skipping minor versions) were returned as errors fromSyncOnce, causing the controller to retry them indefinitely. These are user errors that won't resolve on retry. By persisting them asIntentFailedconditions, the operation controller can detect them and fail the operation with a clear error message, while still retrying transient Cincinnati errors (e.g. 503).Testing
TestNodePoolVersionSyncer_SyncOnce_IntentFailedtable with 6 test cases covering: successful resolution (IntentFailed=False), downgrade, Cincinnati VersionNotFound, Cincinnati upstream error (transient), exceeds control plane, and skipping minor versionsTestNodePoolVersionSyncer_SyncOnce_DesiredVersionUnchangedOnFailure_ChangedOnSuccessto reflect that VersionNotFound now returns nil (persists IntentFailed) instead of returning an errorSpecial notes for your reviewer
service_provider_cluster_conditions.gowas renamed toservice_provider_conditions.gosince conditions now apply to both clusters and node pools[version X.Y.Z]prefix in the IntentFailed condition message to scope conditions to a specific customer version request, preventing stale errors from blocking new requests. There could be another approach to consider a dedicated field for cleaner version scoping. feat: persist IntentFailed for node pool version validation errors #5378 (comment)subscriptionListerwas added to the version controller for feature flag resolution (experimental release features)PR Checklist