Skip to content

Conversation

@guettli
Copy link
Collaborator

@guettli guettli commented Nov 17, 2025

Baremetal PR

Deprecated: This field is deprecated and is
going to be removed when support for v1beta1
will be dropped. Please see
https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md
for more details.

Related HCloud PR 1717

Summary

This PR removes all usage of the deprecated FailureReason and FailureMessage fields from the bare-metal (BM) provider. Status handling is now fully aligned with the newer CAPI conditions-based model.

The PR introduces the DeleteMachineSucceededCondition condition. It gets set instead of using FailureReason and FailureMessage.

Remediation does not try to reboot the machine, if this condition is set. Remediation is stopped immediately, and capi will delete the capi machine. This triggers the deletion of the hbmm, which triggers the deprovisioning of the hbmh.

Why

Upstream CAPI is deprecating these fields. Removing them keeps our provider consistent with the updated status API and prevents reliance on legacy status patterns.

Changes

  • BareMetalMachine.SetFailure() --> s.scope.SetErrorAndRemediate(). This sets the clusterv1.RemediateMachineAnnotation on the capiMachine.
  • During remediation the controller checks if the remediation was started by SetErrorAndRemediate(). If "yes", then no reboot gets done. The remediation stops, so that the machine gets deleted.
  • Removed all references to FailureReason and FailureMessage in BM controllers and API types.
  • Adjusted tests to match the new status behavior.

@github-actions github-actions bot added size/M Denotes a PR that changes 50-200 lines, ignoring generated files. area/code Changes made in the code directory area/api Changes made in the api directory labels Nov 17, 2025
@github-actions github-actions bot added size/L Denotes a PR that changes 200-800 lines, ignoring generated files. and removed size/M Denotes a PR that changes 50-200 lines, ignoring generated files. labels Nov 20, 2025
Base automatically changed from tg/pre--remove-failure-reason--baremetal--on-top-main to main December 3, 2025 12:04
const (
// RemediationSucceededCondition is:
// - False when the corresponding CAPI Machine has the "cluster.x-k8s.io/remediate-machine" annotation set and will be remediated by CAPI soon.
// - True otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be true otherwise. It should be only true IMO after it succeeded. Otherwise it should not be set at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, what about deleting the condition 6 hours after it got set to True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting it to true is fine, once it actually got remediated. Then we can think about what you propose, but that would be an optimization that we could do later IMO. However, what I really would not like, and I think we currently do it, is to set the condition on true, even though we didn't even ever remediate the machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that again. With the change of yesterday the conditions does either not exist, or it is False. Finally, capi machine and infra machine get deleted. This means the condition is gone.

I renamed it to DeleteMachineSucceededCondition.

Using "remediation" in this context is just confusing. This is just a step to get to the goal: Delete capi and infra machine. If possible, I would delete these machines directly. But this is not supported.

I updated names and the docs to make this more clear.

Is it easier to understand now?

ctx = ctrl.LoggerInto(ctx, log)

remediateConditionOfHbmm := conditions.Get(hetznerBareMetalMachine, infrav1.NoRemediateMachineAnnotationCondition)
if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

remediateConditionOfHbmm := conditions.Get(hbmMachine, infrav1.RemediationSucceededCondition)
if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {
// The hbmm will be deleted. Do not reconcile it.
log.Info("hbmm has RemediationSucceededCondition=False. Waiting for deletion")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition is also set on false if it is still in progress, no? I don't understand this code snippet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janiskemper maybe we should rename it to DeleteInfraMachineSucceeded. Because that is the intention. We do not want to reboot the machine.

This will be done via remediation, but this is just the path which need to be followed, because the controller can't delete the capi-machine. Remediation is not really the intention, the machine should get deleted.

Does this make more sense now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand it yet. We have the RemediationSucceeded condition, which is set on false with the reason "remediation ongoing". At least this is how we do it on hbmh. Do we have some different logic here on hbmm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at the new docstring of DeleteMachineSucceededCondition. Is it more clear now?

return reconcile.Result{}, s.setOwnerRemediatedConditionToFailed(ctx,
"exit remediation because hbmm has no host annotation")
}
// if SetErrorAndRemediate() was used to stop provisioning, do not try to reboot server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the name of the function. SetErrorAndRemediate sounds to me as if you explicitly want this remediation file to trigger. However, you instead do the opposite. Shouldn't we rename this function SetErrorAndRemediate so that readers are not let to the wrong direction?

If I understand correctly, you want to set an error so that remediation via reboot is skipped right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SetErrorAndDeleteMachine() might be a better name. Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Just to confirm again: You wanna use this function in the cases when a remediation via reboot should not be done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@guettli
Copy link
Collaborator Author

guettli commented Dec 3, 2025

One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

@janiskemper

This condition is only mirrored from hbmm to hbmh, so I think this is fine.

@guettli guettli requested a review from janiskemper December 3, 2025 13:43
@guettli guettli changed the base branch from main to v1.1.x December 5, 2025 09:14
}

deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition)
if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for deprovisioning we have to reconcile it no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janiskemper

this is Reconcile of hbmm:

	if !hbmMachine.ObjectMeta.DeletionTimestamp.IsZero() {
		return r.reconcileDelete(ctx, machineScope)
	}

	deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition)
	if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse {
		// The hbmm will be deleted. Do not reconcile it.
		log.Info("hbmm has DeleteMachineSucceededCondition=False. Waiting for deletion")
		return reconcile.Result{}, nil
	}

For the machine I think the current way (in PR) is fine. Delete will be handled before the line you mentioned.

I added some docs to hbmh, so the process of how hbmm triggers deprovisioning of hbmh is more clear:

	// InstallImage is the configuration that is used for the autosetup configuration for installing
	// an OS via InstallImage. The field has an additional usage: When a hbmm gets deleted, then the
	// hbmm controller sets this field of the hbmh to nil. This indicates the hbmh controller that
	// deprovisioning should started
	//
	//  +optional
	InstallImage *InstallImage `json:"installImage,omitempty"`

ok now?

)

const (
// DeleteMachineSucceededCondition is False when the capi (and infra) machine should be deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood now in the updated code how you do it. You set the remediate annotation on the CAPI machine, and set the condition here so that the remediation controller doesn't try to reboot.

That you propagate the condition value from the hbmm to hmbh doesn't seem to have any use, because all reasons for setting this condition come from the hbmh itself. Therefore, propagating the hbmm condition to hbmh seems redundant - because the source of the information is already in the status of hbmh.

All we need is to set an annotation on the CAPI machine, and make the remediation controller know that it shouldn't reboot.

A condition doesn't make sense for me here, because it is supposed to give a user information on the state of the object. The user has that information anyway, without this new condition.

Instead, we can just add a second annotation that is set on hbmm and which gives the information to the remediation controller to not reboot.

IMO that is all we need and enough to remove the new condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general way two CRDs/controllers interact. Imagine we have 1 and 2:

Reconcile pseudocode for CRD1 done by controller1:

  • reads spec of (own) resource1
  • reads status of (other) resource2
  • try to implement desired state
  • update status of (own) resource1.

Annotations are usually not use for that.

Example for an exception: The LB Service on hcloud needs additional information. But the resource is a generic Kubernetes resource. So the implementation expects some extra data in annotations, so that the particular controller can be used via the generic resource (like Service).

Look at how the remediation controller signals capi that the work is finished: Via a condition.

We should follow the best practices and use a condition, not an annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't wanna have anything in the status though, but you want to give a direct order to the remediation controller. This is done in the same way via the annotation to the Machine object. The problem with this condition is that it is very intransparent because it is not understandable for users or readers of the code, what it is supposed to do. I didn't understand it, even while reviewing your PR for the first times. That's why I would like to have something more explicit and easy to understand. If you don't like the annotation, then I propose to not do anything new but to re-use the existing information.

For example, we have the permanent errors. With them, we can easily integrate the logic to not restart any server where the hmbh has a permanent error. Same with the maintenance mode: We can also skip rebase in that case.

The only cases that I see in your PR are for maintenance mode and for permanent errors. Both cases are already available on the host.

What we do right now is first propagating the information to the hbmm, where we don't use it really, but bring it instead back to the hmbh where it comes from and then use it there in the remediation controller.

However, it would be very easy to directly use that information (which is the case for the maintennace mode already) in the remediation controller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with not mirroring the condition to the host. This is not needed.

The remediation controller reads the condition from the hbmm, not the host.

Remediation must not use hbmh, because hbmm can exist without having a hbmh. Same for hcloud.

I currently see no alternative to using a new condition on hbmm.

If you want to, I can remove the mirroring to hbmh. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we currently fetch the hmbh in the remediation because that is necessary in the process. Therefore we can use the information in the hmbh. I don't understand why you say that you don't see an alternative to a new condition on hbmm. Can you explain how you wanna get rid of fetching hbmh in remediation controller? Right now that is done. And as soon as we fetch hmbh, then we have all the information available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at this:

We have the Machine object already in the hcloud remediation. We can therefore easily read the annotation. This would be a few lines of code, and we have what we wanna achieve - without adding new conditions. Same as with the hbmh code.

Copy link
Collaborator Author

@guettli guettli Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janiskemper, yes the capi machine is in the local cache of the controller, getting it is easy.

But the annotation on the capi machine is about remediation. This is a general purpose annotation which we use to trigger capi. But we need to differentiate between normal remediation and "please delete that machine".

We could "invent" an own annotation and add it to the capi-machine. But I would like to prefer to use an annotation.

I don't understand why you do not want to add a Condition. Can you please elaborate why you think it is a bad idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I got confused with the case of baremetal, where we have that information ready in the host. You agree that we don't need a condition there right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hcloud I propose to use the existing condition "ServerAvailableCondition" which is used for very similar cases, and introduce a specific reason explaining that the server has a failure and will get deleted. This will explain better to users what is happening and has the same effect as the new condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You agree that we don't need a condition there right?

no. I would like to handle hcloud and bm in the same way. A new condition would solve that. I do not understand why you do not want that.

@janiskemper
Copy link
Contributor

also: plz update the PR description. Write down your approach and the idea behind it - not only the individual small code changes. The PR description also seems to be outdated and show changes that have been done in other PRs.

@guettli guettli requested a review from janiskemper December 8, 2025 15:08
@guettli
Copy link
Collaborator Author

guettli commented Dec 8, 2025

also: plz update the PR description. Write down your approach and the idea behind it - not only the individual small code changes. The PR description also seems to be outdated and show changes that have been done in other PRs.

@janiskemper the PR description was updated. Please check again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Changes made in the api directory area/code Changes made in the code directory area/github Changes made in the github directory area/hack Changes made in the hack directory size/L Denotes a PR that changes 200-800 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants