Add event recording and status conditions for worker deployments by thearcticwatch · Pull Request #203 · temporalio/temporal-worker-controller

thearcticwatch · 2026-02-21T00:39:46Z

What changed: Added Kubernetes events and status conditions

(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.

##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.

Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28
How was this tested:
added unit tests
Any docs updates needed?
N/A

CLAassistant · 2026-02-21T00:39:54Z

All committers have signed the CLA.

internal/controller/worker_controller.go

carlydf

also make fmt-imports will solve some of your lint errors

… usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

carlydf

looking good! just did initial review, we should still add a functional test once these comments are addressed.

I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.

internal/controller/execplan.go

carlydf · 2026-02-24T04:50:07Z

internal/controller/worker_controller.go

+		r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue,
+			"RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID)


It feels inconsistent to me to be saying "rollout ready", "rollout succeeded", and "rollout complete" all together. Maybe just "rollout complete" for all of them?

Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.

I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout

I standardized this all to be "RolloutComplete" in the changes I pushed. Leaving comment open for visibility

Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.

I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout

@thearcticwatch When I said the above, I was wrong about where the "write" happens. I thought moving setCondition within executePlan was meaningful for reducing API writes, but setCondition only modifies an in-memory Go struct and makes no network call at all. The actual write to the Kubernetes API server happens exclusively when r.Status().Update() is called. So what matters for write frequency isn't where setCondition is placed, but how many times Update() is called per reconcile. The real fix was to consolidate from two Status().Update() calls per successful reconcile down to one, by removing the intermediate write after generateStatus and doing a single write at the very end of the loop after executePlan completes.

internal/controller/worker_controller.go

Shivs11 · 2026-02-27T03:17:15Z

internal/controller/worker_controller.go

+		r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "TemporalConnectionNotFound",
+			"Unable to fetch TemporalConnection %q: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err)
+		r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionFalse,
+			"TemporalConnectionNotFound", fmt.Sprintf("TemporalConnection %q not found: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err))
+		_ = r.Status().Update(ctx, &workerDeploy)


seems like this pattern of setting and recording events is repeated throughout the codebase; think we should clean this up by having a nice helper which could also make it look neater

wdyt

yes, but I wouldn't block this PR on that personally

Shivs11 · 2026-02-27T03:28:46Z

internal/controller/execplan.go

 )

-func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, temporalClient sdkclient.Client, p *plan) error {
+func (r *TemporalWorkerDeploymentReconciler) executeK8sOperations(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, p *plan) error {


any specific reason to rename this? i quite liked the approach of having these three files/steps where we first generate a status, make a plan and then act on it.

func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, temporalClient sdkclient.Client, p *plan) error { still exists (see below), it now just calls multiple helper functions in sequence, starting with executeK8sOperations. Apparently this refactor gets rid of the cognitive complexity linter error.

Shivs11 · 2026-02-27T03:31:16Z

internal/controller/execplan.go

+			ConflictToken: vcfg.ConflictToken,
+			Identity:      getControllerIdentity(),
+		}); err != nil {
+			r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed",


we should also log the error here, in addition to recording the event here imo. We have followed this elsewhere in the same file and in the code, and I think it could be good practice given that if we do have a persistence failure, the logs would still be an option for an operator.

We are logging the error on line 166, no?

Shivs11 · 2026-02-27T03:32:53Z

internal/controller/worker_controller.go

+		r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanGenerationFailed",
+			"Unable to generate reconciliation plan: %v", err)
 		return ctrl.Result{}, err
 	}

 	// Execute the plan, handling any errors
-	if err := r.executePlan(ctx, l, temporalClient, plan); err != nil {
+	if err := r.executePlan(ctx, l, &workerDeploy, temporalClient, plan); err != nil {
+		r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanExecutionFailed",
+			"Unable to execute reconciliation plan: %v", err)


any specific reason you chose to not persist failures for plan generation and execution phases but persisted when we do generate the status from the server?

do you mean, why just record an event and not also set a condition?
The only conditions we have right now are ConditionTemporalConnectionHealthy and ConditionRolloutComplete. ConditionRolloutComplete is set in execplan right after the SetCurrent call succeeds (so that we don't do it on every iteration).

Ah actually, I'm not sure if doing

// Mark TemporalConnection as valid since we fetched it and resolved auth r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionTrue, temporaliov1alpha1.ReasonTemporalConnectionHealthy, "TemporalConnection is healthy and auth secret is resolved")

unconditionally on every iteration is good or bad. Will ask Claude

turns out, setCondition doesn't do the write, r.Status().Update() does, so my concern about setting ConditionTemporalConnectionHealthy on every iteration being potentially wasteful was wrong

carlydf

really close from my perspective. will push a commit showing what I mean about the stricter string types for EventType and ConditionType.

carlydf · 2026-03-03T22:28:58Z

internal/controller/execplan.go

+			Identity:      getControllerIdentity(),
+		}); err != nil {
+			l.Error(err, "unable to set ramping deployment version", "buildID", vcfg.BuildID, "percentage", vcfg.RampPercentage)
+			r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed",


sorry to nitpick, but "version registration" sounds to me like what happens when a versioned poller first arrives, polling the server and creating a version record. Then, you could say the version has been "registered" because it exists. Anyway, "VersionRoutingConfigUpdateFailed" or "VersionPromotionFailed" or "RoutingConfigUpdateFailed" makes more sense to me. But open to hearing reasons for "VersionRegistrationFailed" being better! And I also don't want to "bikeshed" naming, but at the same time want to choose names that reinforce existing terminology that we use elsewhere in the versioning world (ie. "routing config" or "version promotion").

The other note I have is, we should consider making these event strings:

"VersionRegistrationFailed"

"MetadataUpdateFailed"

"TestWorkflowStartFailed"

"DeploymentXXXFailed"

"PlanXXXFailed"
constants so that it's easy to see a list of all the possible event types in one place, and so that in the cases where we emit the same event type in multiple places, there is no risk of spelling errors.

addressed in standardize event/condition reason constants and split by API stability

carlydf · 2026-03-03T22:36:19Z

internal/controller/worker_controller.go

 	}, &temporalConnection); err != nil {
 		l.Error(err, "unable to fetch TemporalConnection")
+		r.recordWarningAndSetCondition(ctx, &workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy,
+			"TemporalConnectionNotFound",


similar note as above. I'd recommend using constants for these conditions: "TemporalConnectionNotFound", "AuthSecretInvalid", "TemporalConnectionHealthy", "TemporalClientCreationFailed", "TemporalStateFetchFailed".

What I see a lot in the Go SDK which I look to as a well-designed code base is if a string value is actually a strict subset of valid strings, make a special type for it, and then use that type to define the acceptable constants as well as the input values of functions. I can check out this branch and push a commit to show what I mean, if that's cool with you!

addressed in standardize event/condition reason constants and split by API stability

"Registration" already has a meaning in Temporal versioning (a worker polling for the first time creates a version record). "Promotion" better describes setting a version as current or ramping, which moves it forward in the rollout lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…est scenarios ranked by trigger difficulty

carlydf · 2026-03-04T00:45:21Z

internal/controller/worker_controller.go

 	}

+	// Persist any conditions set during plan execution (e.g. RolloutComplete).
+	_ = r.Status().Update(ctx, &workerDeploy)


this was added because without it, the RolloutComplete condition was never being written back to the status, but it's wasteful, so I'm fixing that now

addressed in only write to status once per reconcile loop

…asonAuthSecretInvalid is untriggerable

carlydf · 2026-03-04T03:16:58Z

internal/controller/worker_controller.go

+		// Note: as things are now, this will never happen, because getAPIKeySecretName only errors when secretRef == nil,
+		// but resolveAuthSecretName only calls it inside the tc.Spec.APIKeySecretRef != nil branch


we should decide what to do here. We can refactor resolveAuthSecretName to not return an error, and then completely get rid of ReasonAuthSecretInvalid, or we could just leave it "just in case".

One issue is that because temporaliov1alpha1.ReasonAuthSecretInvalid is in the public API, even if it never happens, it could be weird or feel risky to remove later. Because of that, refactoring this stack now is probably best.

One idea for the refactor: instead of refactoring to never error, we could require people to opt-in to "No Auth Mode." Then, if people don't put any auth mode in, there would be an error that we would surface here. Tbh, that's probably the safest solution.

thearcticwatch requested review from a team and jlegrone as code owners February 21, 2026 00:39

carlydf reviewed Feb 21, 2026

View reviewed changes

internal/controller/worker_controller.go Outdated Show resolved Hide resolved

carlydf reviewed Feb 21, 2026

View reviewed changes

thearcticwatch enabled auto-merge (squash) February 21, 2026 14:14

thearcticwatch and others added 4 commits February 21, 2026 12:26

First attempt at adding events to worker controller

15d7b9a

Adding events to worker controller

4495d80

fix the breaking tests

6c8e4f8

Fix golangci-lint errors: RBAC comment spacing and deprecated Requeue…

5900793

… usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thearcticwatch force-pushed the add_events branch from 5632069 to 5900793 Compare February 21, 2026 20:27

Refactor executePlan to reduce cognitive complexity

e65c9d1

carlydf reviewed Feb 24, 2026

View reviewed changes

addressing the comments

df3448a

Shivs11 reviewed Feb 27, 2026

View reviewed changes

thearcticwatch added 2 commits February 27, 2026 14:52

additional refactoring and addressing comments

0b76e06

Merge branch 'main' into add_events

06882a8

carlydf reviewed Mar 3, 2026

View reviewed changes

carlydf and others added 3 commits March 3, 2026 15:33

standardize event/condition reason constants and split by API stability

6e92e3a

add functional tests for events and conditions, enumerate remaining t…

11d382a

…est scenarios ranked by trigger difficulty

carlydf reviewed Mar 4, 2026

View reviewed changes

carlydf added 3 commits March 3, 2026 17:09

only write to status once per reconcile loop

371a02f

print percentage correctly when it's an integer in [0,100]

65c8236

move all unhealthy connection tests into one function, explain why Re…

ddef655

…asonAuthSecretInvalid is untriggerable

carlydf reviewed Mar 4, 2026

View reviewed changes

		r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue,
		"RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID)

		// Note: as things are now, this will never happen, because getAPIKeySecretName only errors when secretRef == nil,
		// but resolveAuthSecretName only calls it inside the tc.Spec.APIKeySecretRef != nil branch

Conversation

thearcticwatch commented Feb 21, 2026

What changed: Added Kubernetes events and status conditions

Uh oh!

CLAassistant commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

carlydf left a comment

Choose a reason for hiding this comment

Uh oh!

carlydf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlydf Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Feb 21, 2026 •

edited

Loading

carlydf Mar 4, 2026 •

edited

Loading

carlydf Mar 4, 2026 •

edited

Loading

carlydf Mar 4, 2026 •

edited

Loading

carlydf Mar 4, 2026 •

edited

Loading