-
Notifications
You must be signed in to change notification settings - Fork 97
🌱 Remove usage of FailureReason and FailureMessage (baremetal) #1716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.1.x
Are you sure you want to change the base?
Conversation
> Deprecated: This field is deprecated and is > going to be removed when support for v1beta1 > will be dropped. Please see > https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md > for more details.
…n controller would need to be changed.
Before the test did not realy test something because the state is none at the beginning (before provisioning), too.
…into tg/remove-failure-reason--baremetal
…e-reason--baremetal--on-top-main
…e-reason--baremetal--on-top-main
…into tg/remove-failure-reason--baremetal
api/v1beta1/conditions_const.go
Outdated
| const ( | ||
| // RemediationSucceededCondition is: | ||
| // - False when the corresponding CAPI Machine has the "cluster.x-k8s.io/remediate-machine" annotation set and will be remediated by CAPI soon. | ||
| // - True otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't be true otherwise. It should be only true IMO after it succeeded. Otherwise it should not be set at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, what about deleting the condition 6 hours after it got set to True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting it to true is fine, once it actually got remediated. Then we can think about what you propose, but that would be an optimization that we could do later IMO. However, what I really would not like, and I think we currently do it, is to set the condition on true, even though we didn't even ever remediate the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that again. With the change of yesterday the conditions does either not exist, or it is False. Finally, capi machine and infra machine get deleted. This means the condition is gone.
I renamed it to DeleteMachineSucceededCondition.
Using "remediation" in this context is just confusing. This is just a step to get to the goal: Delete capi and infra machine. If possible, I would delete these machines directly. But this is not supported.
I updated names and the docs to make this more clear.
Is it easier to understand now?
| ctx = ctrl.LoggerInto(ctx, log) | ||
|
|
||
| remediateConditionOfHbmm := conditions.Get(hetznerBareMetalMachine, infrav1.NoRemediateMachineAnnotationCondition) | ||
| if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh
| remediateConditionOfHbmm := conditions.Get(hbmMachine, infrav1.RemediationSucceededCondition) | ||
| if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse { | ||
| // The hbmm will be deleted. Do not reconcile it. | ||
| log.Info("hbmm has RemediationSucceededCondition=False. Waiting for deletion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the condition is also set on false if it is still in progress, no? I don't understand this code snippet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janiskemper maybe we should rename it to DeleteInfraMachineSucceeded. Because that is the intention. We do not want to reboot the machine.
This will be done via remediation, but this is just the path which need to be followed, because the controller can't delete the capi-machine. Remediation is not really the intention, the machine should get deleted.
Does this make more sense now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand it yet. We have the RemediationSucceeded condition, which is set on false with the reason "remediation ongoing". At least this is how we do it on hbmh. Do we have some different logic here on hbmm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the new docstring of DeleteMachineSucceededCondition. Is it more clear now?
| return reconcile.Result{}, s.setOwnerRemediatedConditionToFailed(ctx, | ||
| "exit remediation because hbmm has no host annotation") | ||
| } | ||
| // if SetErrorAndRemediate() was used to stop provisioning, do not try to reboot server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the name of the function. SetErrorAndRemediate sounds to me as if you explicitly want this remediation file to trigger. However, you instead do the opposite. Shouldn't we rename this function SetErrorAndRemediate so that readers are not let to the wrong direction?
If I understand correctly, you want to set an error so that remediation via reboot is skipped right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, SetErrorAndDeleteMachine() might be a better name. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Just to confirm again: You wanna use this function in the cases when a remediation via reboot should not be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
This condition is only mirrored from hbmm to hbmh, so I think this is fine. |
| } | ||
|
|
||
| deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition) | ||
| if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for deprovisioning we have to reconcile it no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is Reconcile of hbmm:
if !hbmMachine.ObjectMeta.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, machineScope)
}
deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition)
if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse {
// The hbmm will be deleted. Do not reconcile it.
log.Info("hbmm has DeleteMachineSucceededCondition=False. Waiting for deletion")
return reconcile.Result{}, nil
}For the machine I think the current way (in PR) is fine. Delete will be handled before the line you mentioned.
I added some docs to hbmh, so the process of how hbmm triggers deprovisioning of hbmh is more clear:
// InstallImage is the configuration that is used for the autosetup configuration for installing
// an OS via InstallImage. The field has an additional usage: When a hbmm gets deleted, then the
// hbmm controller sets this field of the hbmh to nil. This indicates the hbmh controller that
// deprovisioning should started
//
// +optional
InstallImage *InstallImage `json:"installImage,omitempty"`ok now?
| ) | ||
|
|
||
| const ( | ||
| // DeleteMachineSucceededCondition is False when the capi (and infra) machine should be deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood now in the updated code how you do it. You set the remediate annotation on the CAPI machine, and set the condition here so that the remediation controller doesn't try to reboot.
That you propagate the condition value from the hbmm to hmbh doesn't seem to have any use, because all reasons for setting this condition come from the hbmh itself. Therefore, propagating the hbmm condition to hbmh seems redundant - because the source of the information is already in the status of hbmh.
All we need is to set an annotation on the CAPI machine, and make the remediation controller know that it shouldn't reboot.
A condition doesn't make sense for me here, because it is supposed to give a user information on the state of the object. The user has that information anyway, without this new condition.
Instead, we can just add a second annotation that is set on hbmm and which gives the information to the remediation controller to not reboot.
IMO that is all we need and enough to remove the new condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general way two CRDs/controllers interact. Imagine we have 1 and 2:
Reconcile pseudocode for CRD1 done by controller1:
- reads spec of (own) resource1
- reads status of (other) resource2
- try to implement desired state
- update status of (own) resource1.
Annotations are usually not use for that.
Example for an exception: The LB Service on hcloud needs additional information. But the resource is a generic Kubernetes resource. So the implementation expects some extra data in annotations, so that the particular controller can be used via the generic resource (like Service).
Look at how the remediation controller signals capi that the work is finished: Via a condition.
We should follow the best practices and use a condition, not an annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't wanna have anything in the status though, but you want to give a direct order to the remediation controller. This is done in the same way via the annotation to the Machine object. The problem with this condition is that it is very intransparent because it is not understandable for users or readers of the code, what it is supposed to do. I didn't understand it, even while reviewing your PR for the first times. That's why I would like to have something more explicit and easy to understand. If you don't like the annotation, then I propose to not do anything new but to re-use the existing information.
For example, we have the permanent errors. With them, we can easily integrate the logic to not restart any server where the hmbh has a permanent error. Same with the maintenance mode: We can also skip rebase in that case.
The only cases that I see in your PR are for maintenance mode and for permanent errors. Both cases are already available on the host.
What we do right now is first propagating the information to the hbmm, where we don't use it really, but bring it instead back to the hmbh where it comes from and then use it there in the remediation controller.
However, it would be very easy to directly use that information (which is the case for the maintennace mode already) in the remediation controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with not mirroring the condition to the host. This is not needed.
The remediation controller reads the condition from the hbmm, not the host.
Remediation must not use hbmh, because hbmm can exist without having a hbmh. Same for hcloud.
I currently see no alternative to using a new condition on hbmm.
If you want to, I can remove the mirroring to hbmh. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we currently fetch the hmbh in the remediation because that is necessary in the process. Therefore we can use the information in the hmbh. I don't understand why you say that you don't see an alternative to a new condition on hbmm. Can you explain how you wanna get rid of fetching hbmh in remediation controller? Right now that is done. And as soon as we fetch hmbh, then we have all the information available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at this:
| // Fetch the Machine. |
We have the Machine object already in the hcloud remediation. We can therefore easily read the annotation. This would be a few lines of code, and we have what we wanna achieve - without adding new conditions. Same as with the hbmh code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janiskemper, yes the capi machine is in the local cache of the controller, getting it is easy.
But the annotation on the capi machine is about remediation. This is a general purpose annotation which we use to trigger capi. But we need to differentiate between normal remediation and "please delete that machine".
We could "invent" an own annotation and add it to the capi-machine. But I would like to prefer to use an annotation.
I don't understand why you do not want to add a Condition. Can you please elaborate why you think it is a bad idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I got confused with the case of baremetal, where we have that information ready in the host. You agree that we don't need a condition there right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hcloud I propose to use the existing condition "ServerAvailableCondition" which is used for very similar cases, and introduce a specific reason explaining that the server has a failure and will get deleted. This will explain better to users what is happening and has the same effect as the new condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You agree that we don't need a condition there right?
no. I would like to handle hcloud and bm in the same way. A new condition would solve that. I do not understand why you do not want that.
|
also: plz update the PR description. Write down your approach and the idea behind it - not only the individual small code changes. The PR description also seems to be outdated and show changes that have been done in other PRs. |
@janiskemper the PR description was updated. Please check again. |
Baremetal PR
Related HCloud PR 1717
Summary
This PR removes all usage of the deprecated
FailureReasonandFailureMessagefields from the bare-metal (BM) provider. Status handling is now fully aligned with the newer CAPI conditions-based model.The PR introduces the
DeleteMachineSucceededConditioncondition. It gets set instead of usingFailureReasonandFailureMessage.Remediation does not try to reboot the machine, if this condition is set. Remediation is stopped immediately, and capi will delete the capi machine. This triggers the deletion of the hbmm, which triggers the deprovisioning of the hbmh.
Why
Upstream CAPI is deprecating these fields. Removing them keeps our provider consistent with the updated status API and prevents reliance on legacy status patterns.
Changes
FailureReasonandFailureMessagein BM controllers and API types.