Skip to content

Add event recording and status conditions for worker deployments#203

Open
thearcticwatch wants to merge 14 commits intomainfrom
add_events
Open

Add event recording and status conditions for worker deployments#203
thearcticwatch wants to merge 14 commits intomainfrom
add_events

Conversation

@thearcticwatch
Copy link

What changed: Added Kubernetes events and status conditions

(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.

##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.

  1. Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28

  2. How was this tested:
    added unit tests

  3. Any docs updates needed?
    N/A

@thearcticwatch thearcticwatch requested review from a team and jlegrone as code owners February 21, 2026 00:39
@CLAassistant
Copy link

CLAassistant commented Feb 21, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also make fmt-imports will solve some of your lint errors

@thearcticwatch thearcticwatch enabled auto-merge (squash) February 21, 2026 14:14
Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just did initial review, we should still add a functional test once these comments are addressed.

I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.

Comment on lines +283 to +284
r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue,
"RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels inconsistent to me to be saying "rollout ready", "rollout succeeded", and "rollout complete" all together. Maybe just "rollout complete" for all of them?

Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.

I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I standardized this all to be "RolloutComplete" in the changes I pushed. Leaving comment open for visibility

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.

I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout

@thearcticwatch When I said the above, I was wrong about where the "write" happens. I thought moving setCondition within executePlan was meaningful for reducing API writes, but setCondition only modifies an in-memory Go struct and makes no network call at all. The actual write to the Kubernetes API server happens exclusively when r.Status().Update() is called. So what matters for write frequency isn't where setCondition is placed, but how many times Update() is called per reconcile. The real fix was to consolidate from two Status().Update() calls per successful reconcile down to one, by removing the intermediate write after generateStatus and doing a single write at the very end of the loop after executePlan completes.

Comment on lines +153 to +157
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "TemporalConnectionNotFound",
"Unable to fetch TemporalConnection %q: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err)
r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionFalse,
"TemporalConnectionNotFound", fmt.Sprintf("TemporalConnection %q not found: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err))
_ = r.Status().Update(ctx, &workerDeploy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this pattern of setting and recording events is repeated throughout the codebase; think we should clean this up by having a nice helper which could also make it look neater

wdyt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I wouldn't block this PR on that personally

)

func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, temporalClient sdkclient.Client, p *plan) error {
func (r *TemporalWorkerDeploymentReconciler) executeK8sOperations(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, p *plan) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason to rename this? i quite liked the approach of having these three files/steps where we first generate a status, make a plan and then act on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, temporalClient sdkclient.Client, p *plan) error { still exists (see below), it now just calls multiple helper functions in sequence, starting with executeK8sOperations. Apparently this refactor gets rid of the cognitive complexity linter error.

ConflictToken: vcfg.ConflictToken,
Identity: getControllerIdentity(),
}); err != nil {
r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also log the error here, in addition to recording the event here imo. We have followed this elsewhere in the same file and in the code, and I think it could be good practice given that if we do have a persistence failure, the logs would still be an option for an operator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are logging the error on line 166, no?

Comment on lines +268 to +276
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanGenerationFailed",
"Unable to generate reconciliation plan: %v", err)
return ctrl.Result{}, err
}

// Execute the plan, handling any errors
if err := r.executePlan(ctx, l, temporalClient, plan); err != nil {
if err := r.executePlan(ctx, l, &workerDeploy, temporalClient, plan); err != nil {
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanExecutionFailed",
"Unable to execute reconciliation plan: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason you chose to not persist failures for plan generation and execution phases but persisted when we do generate the status from the server?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean, why just record an event and not also set a condition?
The only conditions we have right now are ConditionTemporalConnectionHealthy and ConditionRolloutComplete. ConditionRolloutComplete is set in execplan right after the SetCurrent call succeeds (so that we don't do it on every iteration).

Ah actually, I'm not sure if doing

	// Mark TemporalConnection as valid since we fetched it and resolved auth
	r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionTrue,
		temporaliov1alpha1.ReasonTemporalConnectionHealthy, "TemporalConnection is healthy and auth secret is resolved")

unconditionally on every iteration is good or bad. Will ask Claude

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out, setCondition doesn't do the write, r.Status().Update() does, so my concern about setting ConditionTemporalConnectionHealthy on every iteration being potentially wasteful was wrong

Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really close from my perspective. will push a commit showing what I mean about the stricter string types for EventType and ConditionType.

Identity: getControllerIdentity(),
}); err != nil {
l.Error(err, "unable to set ramping deployment version", "buildID", vcfg.BuildID, "percentage", vcfg.RampPercentage)
r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to nitpick, but "version registration" sounds to me like what happens when a versioned poller first arrives, polling the server and creating a version record. Then, you could say the version has been "registered" because it exists. Anyway, "VersionRoutingConfigUpdateFailed" or "VersionPromotionFailed" or "RoutingConfigUpdateFailed" makes more sense to me. But open to hearing reasons for "VersionRegistrationFailed" being better! And I also don't want to "bikeshed" naming, but at the same time want to choose names that reinforce existing terminology that we use elsewhere in the versioning world (ie. "routing config" or "version promotion").

The other note I have is, we should consider making these event strings:

  • "VersionRegistrationFailed"
  • "MetadataUpdateFailed"
  • "TestWorkflowStartFailed"
  • "DeploymentXXXFailed"
  • "PlanXXXFailed"
    constants so that it's easy to see a list of all the possible event types in one place, and so that in the cases where we emit the same event type in multiple places, there is no risk of spelling errors.

Copy link
Collaborator

@carlydf carlydf Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}, &temporalConnection); err != nil {
l.Error(err, "unable to fetch TemporalConnection")
r.recordWarningAndSetCondition(ctx, &workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy,
"TemporalConnectionNotFound",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar note as above. I'd recommend using constants for these conditions: "TemporalConnectionNotFound", "AuthSecretInvalid", "TemporalConnectionHealthy", "TemporalClientCreationFailed", "TemporalStateFetchFailed".

What I see a lot in the Go SDK which I look to as a well-designed code base is if a string value is actually a strict subset of valid strings, make a special type for it, and then use that type to define the acceptable constants as well as the input values of functions. I can check out this branch and push a commit to show what I mean, if that's cool with you!

Copy link
Collaborator

@carlydf carlydf Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carlydf and others added 3 commits March 3, 2026 15:33
"Registration" already has a meaning in Temporal versioning (a worker
polling for the first time creates a version record). "Promotion" better
describes setting a version as current or ramping, which moves it forward
in the rollout lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
}

// Persist any conditions set during plan execution (e.g. RolloutComplete).
_ = r.Status().Update(ctx, &workerDeploy)
Copy link
Collaborator

@carlydf carlydf Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added because without it, the RolloutComplete condition was never being written back to the status, but it's wasteful, so I'm fixing that now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +163 to +164
// Note: as things are now, this will never happen, because getAPIKeySecretName only errors when secretRef == nil,
// but resolveAuthSecretName only calls it inside the tc.Spec.APIKeySecretRef != nil branch
Copy link
Collaborator

@carlydf carlydf Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should decide what to do here. We can refactor resolveAuthSecretName to not return an error, and then completely get rid of ReasonAuthSecretInvalid, or we could just leave it "just in case".

One issue is that because temporaliov1alpha1.ReasonAuthSecretInvalid is in the public API, even if it never happens, it could be weird or feel risky to remove later. Because of that, refactoring this stack now is probably best.

One idea for the refactor: instead of refactoring to never error, we could require people to opt-in to "No Auth Mode." Then, if people don't put any auth mode in, there would be an error that we would surface here. Tbh, that's probably the safest solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add events to the TemporalWorkerDeployment CRD when there is a problem

4 participants