🌱 Remove usage of FailureReason and FailureMessage (baremetal) #1716

guettli · 2025-11-17T11:26:54Z

Baremetal PR

Deprecated: This field is deprecated and is
going to be removed when support for v1beta1
will be dropped. Please see
https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md
for more details.

Related HCloud PR 1717

Summary

This PR removes all usage of the deprecated FailureReason and FailureMessage fields from the bare-metal (BM) provider. Status handling is now fully aligned with the newer CAPI conditions-based model.

The PR introduces the DeleteMachineSucceededCondition condition. It gets set instead of using FailureReason and FailureMessage.

Remediation does not try to reboot the machine, if this condition is set. Remediation is stopped immediately, and capi will delete the capi machine. This triggers the deletion of the hbmm, which triggers the deprovisioning of the hbmh.

Why

Upstream CAPI is deprecating these fields. Removing them keeps our provider consistent with the updated status API and prevents reliance on legacy status patterns.

Changes

BareMetalMachine.SetFailure() --> s.scope.SetErrorAndRemediate(). This sets the clusterv1.RemediateMachineAnnotation on the capiMachine.
During remediation the controller checks if the remediation was started by SetErrorAndRemediate(). If "yes", then no reboot gets done. The remediation stops, so that the machine gets deleted.
Removed all references to FailureReason and FailureMessage in BM controllers and API types.
Adjusted tests to match the new status behavior.

> Deprecated: This field is deprecated and is > going to be removed when support for v1beta1 > will be dropped. Please see > https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md > for more details.

…n controller would need to be changed.

Before the test did not realy test something because the state is none at the beginning (before provisioning), too.

…into tg/remove-failure-reason--baremetal

…e-reason--baremetal--on-top-main

…into tg/remove-failure-reason--baremetal

janiskemper · 2025-12-03T12:24:15Z

api/v1beta1/conditions_const.go

+const (
+	// RemediationSucceededCondition is:
+	// - False when the corresponding CAPI Machine has the "cluster.x-k8s.io/remediate-machine" annotation set and will be remediated by CAPI soon.
+	// - True otherwise.


it shouldn't be true otherwise. It should be only true IMO after it succeeded. Otherwise it should not be set at all

OK, what about deleting the condition 6 hours after it got set to True?

I think setting it to true is fine, once it actually got remediated. Then we can think about what you propose, but that would be an optimization that we could do later IMO. However, what I really would not like, and I think we currently do it, is to set the condition on true, even though we didn't even ever remediate the machine.

I have thought about that again. With the change of yesterday the conditions does either not exist, or it is False. Finally, capi machine and infra machine get deleted. This means the condition is gone.

I renamed it to DeleteMachineSucceededCondition.

Using "remediation" in this context is just confusing. This is just a step to get to the goal: Delete capi and infra machine. If possible, I would delete these machines directly. But this is not supported.

I updated names and the docs to make this more clear.

Is it easier to understand now?

janiskemper · 2025-12-03T12:26:22Z

controllers/hetznerbaremetalhost_controller.go

 	ctx = ctrl.LoggerInto(ctx, log)

+	remediateConditionOfHbmm := conditions.Get(hetznerBareMetalMachine, infrav1.NoRemediateMachineAnnotationCondition)
+	if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {


One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

janiskemper · 2025-12-03T12:30:23Z

controllers/hetznerbaremetalmachine_controller.go

+	remediateConditionOfHbmm := conditions.Get(hbmMachine, infrav1.RemediationSucceededCondition)
+	if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {
+		// The hbmm will be deleted. Do not reconcile it.
+		log.Info("hbmm has RemediationSucceededCondition=False. Waiting for deletion")


the condition is also set on false if it is still in progress, no? I don't understand this code snippet

@janiskemper maybe we should rename it to DeleteInfraMachineSucceeded. Because that is the intention. We do not want to reboot the machine.

This will be done via remediation, but this is just the path which need to be followed, because the controller can't delete the capi-machine. Remediation is not really the intention, the machine should get deleted.

Does this make more sense now?

I didn't understand it yet. We have the RemediationSucceeded condition, which is set on false with the reason "remediation ongoing". At least this is how we do it on hbmh. Do we have some different logic here on hbmm?

Please have a look at the new docstring of DeleteMachineSucceededCondition. Is it more clear now?

pkg/baremetalutils/baremetalutils.go

janiskemper · 2025-12-03T12:34:48Z

pkg/services/baremetal/remediation/remediation.go

 		return reconcile.Result{}, s.setOwnerRemediatedConditionToFailed(ctx,
 			"exit remediation because hbmm has no host annotation")
 	}
+	// if SetErrorAndRemediate() was used to stop provisioning, do not try to reboot server


I don't understand the name of the function. SetErrorAndRemediate sounds to me as if you explicitly want this remediation file to trigger. However, you instead do the opposite. Shouldn't we rename this function SetErrorAndRemediate so that readers are not let to the wrong direction?

If I understand correctly, you want to set an error so that remediation via reboot is skipped right?

Yes, SetErrorAndDeleteMachine() might be a better name. Do you agree?

yes. Just to confirm again: You wanna use this function in the cases when a remediation via reboot should not be done?

guettli · 2025-12-03T13:43:09Z

One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

@janiskemper

This condition is only mirrored from hbmm to hbmh, so I think this is fine.

…son--baremetal

janiskemper · 2025-12-07T06:29:18Z

controllers/hetznerbaremetalmachine_controller.go

 	}

+	deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition)
+	if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse {


for deprovisioning we have to reconcile it no?

@janiskemper

this is Reconcile of hbmm:

if !hbmMachine.ObjectMeta.DeletionTimestamp.IsZero() { return r.reconcileDelete(ctx, machineScope) } deleteConditionOfHbmm := conditions.Get(hbmMachine, infrav1.DeleteMachineSucceededCondition) if deleteConditionOfHbmm != nil && deleteConditionOfHbmm.Status == corev1.ConditionFalse { // The hbmm will be deleted. Do not reconcile it. log.Info("hbmm has DeleteMachineSucceededCondition=False. Waiting for deletion") return reconcile.Result{}, nil }

For the machine I think the current way (in PR) is fine. Delete will be handled before the line you mentioned.

I added some docs to hbmh, so the process of how hbmm triggers deprovisioning of hbmh is more clear:

// InstallImage is the configuration that is used for the autosetup configuration for installing // an OS via InstallImage. The field has an additional usage: When a hbmm gets deleted, then the // hbmm controller sets this field of the hbmh to nil. This indicates the hbmh controller that // deprovisioning should started // // +optional InstallImage *InstallImage `json:"installImage,omitempty"`

ok now?

janiskemper · 2025-12-07T06:38:58Z

api/v1beta1/conditions_const.go

 )
+
+const (
+	// DeleteMachineSucceededCondition is False when the capi (and infra) machine should be deleted.


I understood now in the updated code how you do it. You set the remediate annotation on the CAPI machine, and set the condition here so that the remediation controller doesn't try to reboot.

That you propagate the condition value from the hbmm to hmbh doesn't seem to have any use, because all reasons for setting this condition come from the hbmh itself. Therefore, propagating the hbmm condition to hbmh seems redundant - because the source of the information is already in the status of hbmh.

All we need is to set an annotation on the CAPI machine, and make the remediation controller know that it shouldn't reboot.

A condition doesn't make sense for me here, because it is supposed to give a user information on the state of the object. The user has that information anyway, without this new condition.

Instead, we can just add a second annotation that is set on hbmm and which gives the information to the remediation controller to not reboot.

IMO that is all we need and enough to remove the new condition.

The general way two CRDs/controllers interact. Imagine we have 1 and 2:

Reconcile pseudocode for CRD1 done by controller1:

reads spec of (own) resource1

reads status of (other) resource2

try to implement desired state

update status of (own) resource1.

Annotations are usually not use for that.

Example for an exception: The LB Service on hcloud needs additional information. But the resource is a generic Kubernetes resource. So the implementation expects some extra data in annotations, so that the particular controller can be used via the generic resource (like Service).

Look at how the remediation controller signals capi that the work is finished: Via a condition.

We should follow the best practices and use a condition, not an annotation.

You don't wanna have anything in the status though, but you want to give a direct order to the remediation controller. This is done in the same way via the annotation to the Machine object. The problem with this condition is that it is very intransparent because it is not understandable for users or readers of the code, what it is supposed to do. I didn't understand it, even while reviewing your PR for the first times. That's why I would like to have something more explicit and easy to understand. If you don't like the annotation, then I propose to not do anything new but to re-use the existing information.

For example, we have the permanent errors. With them, we can easily integrate the logic to not restart any server where the hmbh has a permanent error. Same with the maintenance mode: We can also skip rebase in that case.

The only cases that I see in your PR are for maintenance mode and for permanent errors. Both cases are already available on the host.

What we do right now is first propagating the information to the hbmm, where we don't use it really, but bring it instead back to the hmbh where it comes from and then use it there in the remediation controller.

However, it would be very easy to directly use that information (which is the case for the maintennace mode already) in the remediation controller.

I am fine with not mirroring the condition to the host. This is not needed.

The remediation controller reads the condition from the hbmm, not the host.

Remediation must not use hbmh, because hbmm can exist without having a hbmh. Same for hcloud.

I currently see no alternative to using a new condition on hbmm.

If you want to, I can remove the mirroring to hbmh. What do you think?

we currently fetch the hmbh in the remediation because that is necessary in the process. Therefore we can use the information in the hmbh. I don't understand why you say that you don't see an alternative to a new condition on hbmm. Can you explain how you wanna get rid of fetching hbmh in remediation controller? Right now that is done. And as soon as we fetch hmbh, then we have all the information available

Look at this:

cluster-api-provider-hetzner/controllers/hcloudremediation_controller.go

Line 82 in 27f2f20

// Fetch the Machine.

We have the Machine object already in the hcloud remediation. We can therefore easily read the annotation. This would be a few lines of code, and we have what we wanna achieve - without adding new conditions. Same as with the hbmh code.

@janiskemper, yes the capi machine is in the local cache of the controller, getting it is easy.

But the annotation on the capi machine is about remediation. This is a general purpose annotation which we use to trigger capi. But we need to differentiate between normal remediation and "please delete that machine".

We could "invent" an own annotation and add it to the capi-machine. But I would like to prefer to use an annotation.

I don't understand why you do not want to add a Condition. Can you please elaborate why you think it is a bad idea?

Yes, sorry, I got confused with the case of baremetal, where we have that information ready in the host. You agree that we don't need a condition there right?

For hcloud I propose to use the existing condition "ServerAvailableCondition" which is used for very similar cases, and introduce a specific reason explaining that the server has a failure and will get deleted. This will explain better to users what is happening and has the same effect as the new condition.

You agree that we don't need a condition there right?

no. I would like to handle hcloud and bm in the same way. A new condition would solve that. I do not understand why you do not want that.

janiskemper · 2025-12-07T06:40:17Z

also: plz update the PR description. Write down your approach and the idea behind it - not only the individual small code changes. The PR description also seems to be outdated and show changes that have been done in other PRs.

guettli · 2025-12-08T15:50:25Z

also: plz update the PR description. Write down your approach and the idea behind it - not only the individual small code changes. The PR description also seems to be outdated and show changes that have been done in other PRs.

@janiskemper the PR description was updated. Please check again.

guettli added 3 commits November 17, 2025 12:25

make linter happy.

1fa1406

set condition.

8c7c7a9

guettli mentioned this pull request Nov 17, 2025

🌱 Remove usage of FailureReason and FailureMessage (hcloud) #1717

Open

feedback of AI.

8c81c6c

github-actions bot added size/M Denotes a PR that changes 50-200 lines, ignoring generated files. area/code Changes made in the code directory area/api Changes made in the api directory labels Nov 17, 2025

guettli added 15 commits November 17, 2025 14:38

set condition on host, too. Do not loose the message...

5b3a35d

cleanup, do not set condition from hbmm reconcile on hbmh.

98bcfff

copy condition from hbmm to host. No need to read capi machine.

fb67a97

delete via RemediateAnnotation does not work well. Code in Remediatio…

a5521cd

…n controller would need to be changed.

revert last commit. Delete via remediation is ok.

7d834ca

WIP: give msg to method, so that a more precise error gets created.

cda32ca

show exit message of remediation on capi machine condition.

700d1ec

no need to try a reboot.

f71d8a2

Merge branch 'main' into tg/remove-failure-reason--baremetal

7c57394

more logging, WIP

401a474

fixed test cases.

25890c4

🌱 Fix hbmm delete unit tests.

fc375ce

Before the test did not realy test something because the state is none at the beginning (before provisioning), too.

simplified code, so that AI understands it.

393ecd6

fix failing test.

8c1986a

deduplicated code, cleaned up PR.

7133120

github-actions bot added size/L Denotes a PR that changes 200-800 lines, ignoring generated files. and removed size/M Denotes a PR that changes 50-200 lines, ignoring generated files. labels Nov 20, 2025

guettli added 5 commits November 20, 2025 20:12

...

527de14

adapt comments to no longer use failure reason.

826c0d4

remove todo.

523dd7b

linting.

49a275c

make diff to main smaller.

0726e9a

guettli added 7 commits November 27, 2025 14:22

Merge branch 'tg/pre--remove-failure-reason--baremetal--on-top-main' …

6eba6e0

…into tg/remove-failure-reason--baremetal

fix typo.

c4235a0

Merge remote-tracking branch 'origin/main' into tg/pre--remove-failur…

ffef8bb

…e-reason--baremetal--on-top-main

avoid to return a reconcile.Result which is always zero.

ae5a531

Merge remote-tracking branch 'origin/main' into tg/pre--remove-failur…

aa22b01

…e-reason--baremetal--on-top-main

handle not-found in GetAssociatedHost().

a4cd586

Merge branch 'tg/pre--remove-failure-reason--baremetal--on-top-main' …

8d17c62

…into tg/remove-failure-reason--baremetal

Base automatically changed from tg/pre--remove-failure-reason--baremetal--on-top-main to main December 3, 2025 12:04

Merge branch 'main' into tg/remove-failure-reason--baremetal

436d48c

janiskemper requested changes Dec 3, 2025

View reviewed changes

delete remediate condition on hbmh after remediation was done.

a2fb07c

guettli requested a review from janiskemper December 3, 2025 13:43

guettli added 6 commits December 3, 2025 14:44

PR feedback.

b5eed48

rewording for better understanding.

5c7e31e

amend.

0723453

make linter happy.

568476b

fix flaky test.

ccbc0e1

Merge remote-tracking branch 'origin/main' into tg/remove-failure-rea…

1faf295

…son--baremetal

guettli force-pushed the main branch from 3e1c9c1 to 1636fb3 Compare December 5, 2025 09:08

guettli changed the base branch from main to v1.1.x December 5, 2025 09:14

janiskemper requested changes Dec 7, 2025

View reviewed changes

guettli added 2 commits December 8, 2025 15:43

add docs to InstallImage

455ecbe

Merge branch 'v1.1.x' into tg/remove-failure-reason--baremetal

0851f87

guettli requested a review from janiskemper December 8, 2025 15:08

guettli added 3 commits December 9, 2025 10:11

fix, if hbmh is deleted, but hbmm not.

0ebe767

ignore bm-*.yaml files.

877734c

fix typo.

dcc5dce

🌱 Remove usage of FailureReason and FailureMessage (baremetal) #1716

Are you sure you want to change the base?

🌱 Remove usage of FailureReason and FailureMessage (baremetal) #1716

Uh oh!

Conversation

guettli commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guettli commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guettli Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janiskemper commented Dec 7, 2025

Uh oh!

guettli commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guettli commented Nov 17, 2025 •

edited

Loading

guettli Dec 10, 2025 •

edited

Loading