OCPBUGS-82191: Surface async GCP instance creation errors#160
OCPBUGS-82191: Surface async GCP instance creation errors#160RadekManak wants to merge 1 commit into
Conversation
|
@RadekManak: This pull request references Jira Issue OCPBUGS-82191, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughCreate flow now checks recent zone insert operations before inserting an instance, requeues for in-progress/conflict cases, records provider-side operation errors to MachineCreated, and treats InstancesGet not-found as transient. Compute service API and mocks gained zone-operations list support; tests expanded for async insert scenarios. Changes
Sequence DiagramsequenceDiagram
participant Reconciler as Reconciler
participant ZoneOps as ZoneOperations
participant ComputeAPI as GCP Compute API
Reconciler->>ZoneOps: ZoneOperationsList(filter by target instance/selfLink)
ZoneOps-->>Reconciler: Operation (found / not found)
alt Operation exists and status != "DONE"
Reconciler->>Reconciler: RequeueAfter
else Operation exists and status == "DONE" and has errors
Reconciler->>Reconciler: recordFailedInstanceCreate()
Reconciler-->>Reconciler: Return error
else No relevant operation or DONE without errors
Reconciler->>ComputeAPI: InstancesInsert(instance)
alt InstancesInsert returns 409 Conflict
Reconciler->>Reconciler: RequeueAfter
else InstancesInsert returns Operation
ComputeAPI-->>Reconciler: Operation
Reconciler->>Reconciler: operationError() check
alt operationError present
Reconciler->>Reconciler: recordFailedInstanceCreate()
Reconciler-->>Reconciler: Return error
else
Reconciler->>Reconciler: Continue (creation in progress)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@RadekManak: This pull request references Jira Issue OCPBUGS-82191, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/cloud/gcp/actuators/machine/reconciler.go (1)
462-470: Record failed-create metrics for async operation failures too.This new return path updates conditions, but it bypasses
metrics.RegisterFailedInstanceCreate, so async provider-side failures won't show up in the existing failed-create metrics/alerts.📈 Proposed fix
if err := operationError(insertOp); err != nil { + metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{ + Name: r.machine.Name, + Namespace: r.machine.Namespace, + Reason: "failed to create instance via compute service", + }) r.providerStatus.Conditions = reconcileConditions(r.providerStatus.Conditions, metav1.Condition{ Type: string(machinev1.MachineCreated), Reason: machineCreationFailedReason, Message: err.Error(), Status: metav1.ConditionFalse, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cloud/gcp/actuators/machine/reconciler.go` around lines 462 - 470, The return path when operationError(insertOp) is true updates r.providerStatus.Conditions but skips recording failed-create metrics; call metrics.RegisterFailedInstanceCreate (with the same labels/context used elsewhere for machine failures) immediately before returning from the operationError(insertOp) branch so async/provider-side failures are counted, keeping the existing condition update; locate the operationError(insertOp) check in reconciler.go and add the metrics.RegisterFailedInstanceCreate invocation prior to the fmt.Errorf return.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler.go`:
- Around line 423-430: The current logic in reconciler.go's
latestVisibleInsertOperation check only skips re-create when existingOp.Status
!= "DONE" but allows a DONE operation with provider-side errors to fall through
to InstancesInsert; update the check in the reconciler (function
latestVisibleInsertOperation usage) to detect when existingOp.Status == "DONE"
and existingOp.Error != nil && len(existingOp.Error.Errors) > 0 and treat that
as a terminal failure: log the error (including existingOp.Error.Errors and
r.machine.Name), surface or return an appropriate non-retry error instead of
proceeding to InstancesInsert, and ensure any Requeue/Failure state prevents
issuing a new InstancesInsert for that same visible insert operation.
---
Nitpick comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler.go`:
- Around line 462-470: The return path when operationError(insertOp) is true
updates r.providerStatus.Conditions but skips recording failed-create metrics;
call metrics.RegisterFailedInstanceCreate (with the same labels/context used
elsewhere for machine failures) immediately before returning from the
operationError(insertOp) branch so async/provider-side failures are counted,
keeping the existing condition update; locate the operationError(insertOp) check
in reconciler.go and add the metrics.RegisterFailedInstanceCreate invocation
prior to the fmt.Errorf return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d4c33b05-57d3-4057-b16a-74abdfd9d1eb
📒 Files selected for processing (4)
pkg/cloud/gcp/actuators/machine/reconciler.gopkg/cloud/gcp/actuators/machine/reconciler_test.gopkg/cloud/gcp/actuators/services/compute/computeservice.gopkg/cloud/gcp/actuators/services/compute/computeservice_mock.go
3441b64 to
8e39814
Compare
|
@RadekManak: This pull request references Jira Issue OCPBUGS-82191, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler.go`:
- Around line 504-506: The current check using isNotFoundError(...) only matches
a bare *googleapi.Error and misses wrapped 404 errors; update the logic so the
code uses errors.As to detect a *googleapi.Error (or modify isNotFoundError to
internally use errors.As) and then checks err.Code == 404 before returning
&machinecontroller.RequeueAfterError{RequeueAfter: requeueAfterSeconds *
time.Second}; apply this change for all spots that currently call
isNotFoundError for requeue handling (the 404 requeue branches in reconciler.go,
e.g., the places around the RequeueAfterError returns) so wrapped errors are
handled consistently like the 409 handling that already uses errors.As.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0f81abaf-5607-41b7-b590-375f1cedc8de
📒 Files selected for processing (4)
pkg/cloud/gcp/actuators/machine/reconciler.gopkg/cloud/gcp/actuators/machine/reconciler_test.gopkg/cloud/gcp/actuators/services/compute/computeservice.gopkg/cloud/gcp/actuators/services/compute/computeservice_mock.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/cloud/gcp/actuators/services/compute/computeservice.go
- pkg/cloud/gcp/actuators/services/compute/computeservice_mock.go
- pkg/cloud/gcp/actuators/machine/reconciler_test.go
8e39814 to
f5b6f13
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler.go`:
- Around line 504-506: reconcileMachineWithCloudState currently treats an
InstancesGet 404 as a transient condition and always returns a
RequeueAfterError, which causes endless retries when invoked from update();
change reconcileMachineWithCloudState signature to accept a boolean (e.g.,
treatNotFoundAsTransient bool) and only return
&machinecontroller.RequeueAfterError{...} when isNotFoundError(err) &&
treatNotFoundAsTransient is true; update all call sites so create() invokes
reconcileMachineWithCloudState(..., true) and update() invokes
reconcileMachineWithCloudState(..., false), leaving isNotFoundError and
RequeueAfterError logic intact otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d9048dd5-fbe5-4632-917f-04e678716bbd
📒 Files selected for processing (4)
pkg/cloud/gcp/actuators/machine/reconciler.gopkg/cloud/gcp/actuators/machine/reconciler_test.gopkg/cloud/gcp/actuators/services/compute/computeservice.gopkg/cloud/gcp/actuators/services/compute/computeservice_mock.go
|
@RadekManak can we get help getting this reviewed please - its a release blocker |
Make provider-side instance creation failures visible during reconciliation by checking the returned insert operation and visible in-flight operations instead of relying only on instance lookup.
f5b6f13 to
73fee98
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/cloud/gcp/actuators/machine/reconciler_test.go (1)
197-203: AssertZoneOperationsListquery arguments to lock behavior.These mocks drive return shape, but they don’t validate the
filterandorderByinputs. Adding assertions for expected query content would prevent silent regressions inlatestVisibleInsertOperation()construction.Also applies to: 220-224, 1178-1187
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cloud/gcp/actuators/machine/reconciler_test.go` around lines 197 - 203, The mock implementation for mockZoneOperationsList used in tests should assert the incoming query args (filter and orderBy, and optionally project/zone) to ensure latestVisibleInsertOperation() constructs the correct query; update the mock callbacks (mockZoneOperationsList at the instances around the given diffs and the ones at lines ~220 and ~1178) to check that the received filter string contains the expected operation type and instance metadata and that orderBy equals the expected sort (e.g., "insertTime desc") and fail the test (return an error or call t.Fatalf) when they don’t match so regressions in query construction are detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler.go`:
- Around line 742-749: The loop over op.Error.Errors panics if any slice element
is nil — modify the block in reconciler.go that iterates "for _, e := range
op.Error.Errors" to skip nil entries (if e == nil continue) and also defensively
ensure op.Error and op.Error.Errors are non-nil before iterating; then build
msgs using e.Code and e.Message as before and return the formatted error string.
This prevents a nil dereference while preserving the existing error-message
aggregation.
---
Nitpick comments:
In `@pkg/cloud/gcp/actuators/machine/reconciler_test.go`:
- Around line 197-203: The mock implementation for mockZoneOperationsList used
in tests should assert the incoming query args (filter and orderBy, and
optionally project/zone) to ensure latestVisibleInsertOperation() constructs the
correct query; update the mock callbacks (mockZoneOperationsList at the
instances around the given diffs and the ones at lines ~220 and ~1178) to check
that the received filter string contains the expected operation type and
instance metadata and that orderBy equals the expected sort (e.g., "insertTime
desc") and fail the test (return an error or call t.Fatalf) when they don’t
match so regressions in query construction are detected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 35e9b6bd-5ef3-4114-a95c-fbccf95ed61d
📒 Files selected for processing (4)
pkg/cloud/gcp/actuators/machine/reconciler.gopkg/cloud/gcp/actuators/machine/reconciler_test.gopkg/cloud/gcp/actuators/services/compute/computeservice.gopkg/cloud/gcp/actuators/services/compute/computeservice_mock.go
| for _, e := range op.Error.Errors { | ||
| if e.Message != "" { | ||
| msgs = append(msgs, e.Code+": "+e.Message) | ||
| } else { | ||
| msgs = append(msgs, e.Code) | ||
| } | ||
| } | ||
| return fmt.Errorf("GCP operation failed: %s", strings.Join(msgs, "; ")) |
There was a problem hiding this comment.
Guard against nil entries in operation error details.
Line 743 dereferences e unconditionally. If op.Error.Errors contains a nil element, this panics and masks the provider failure path.
💡 Suggested fix
func operationError(op *compute.Operation) error {
if op == nil || op.Error == nil || len(op.Error.Errors) == 0 {
return nil
}
msgs := make([]string, 0, len(op.Error.Errors))
for _, e := range op.Error.Errors {
+ if e == nil {
+ continue
+ }
if e.Message != "" {
msgs = append(msgs, e.Code+": "+e.Message)
} else {
msgs = append(msgs, e.Code)
}
}
+ if len(msgs) == 0 {
+ return fmt.Errorf("GCP operation failed: unknown provider-side error")
+ }
return fmt.Errorf("GCP operation failed: %s", strings.Join(msgs, "; "))
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/cloud/gcp/actuators/machine/reconciler.go` around lines 742 - 749, The
loop over op.Error.Errors panics if any slice element is nil — modify the block
in reconciler.go that iterates "for _, e := range op.Error.Errors" to skip nil
entries (if e == nil continue) and also defensively ensure op.Error and
op.Error.Errors are non-nil before iterating; then build msgs using e.Code and
e.Message as before and return the formatted error string. This prevents a nil
dereference while preserving the existing error-message aggregation.
|
@RadekManak: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Make provider-side instance creation failures visible during reconciliation by checking the returned insert operation and visible in-flight operations instead of relying only on instance lookup.
Summary by CodeRabbit
New Features
Bug Fixes
Tests