Add event recording and status conditions for worker deployments#203
Add event recording and status conditions for worker deployments#203thearcticwatch wants to merge 14 commits intomainfrom
Conversation
carlydf
left a comment
There was a problem hiding this comment.
also make fmt-imports will solve some of your lint errors
… usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5632069 to
5900793
Compare
carlydf
left a comment
There was a problem hiding this comment.
looking good! just did initial review, we should still add a functional test once these comments are addressed.
I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.
| r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue, | ||
| "RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID) |
There was a problem hiding this comment.
It feels inconsistent to me to be saying "rollout ready", "rollout succeeded", and "rollout complete" all together. Maybe just "rollout complete" for all of them?
Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.
I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout
There was a problem hiding this comment.
I standardized this all to be "RolloutComplete" in the changes I pushed. Leaving comment open for visibility
There was a problem hiding this comment.
Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.
I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout
@thearcticwatch When I said the above, I was wrong about where the "write" happens. I thought moving setCondition within executePlan was meaningful for reducing API writes, but setCondition only modifies an in-memory Go struct and makes no network call at all. The actual write to the Kubernetes API server happens exclusively when r.Status().Update() is called. So what matters for write frequency isn't where setCondition is placed, but how many times Update() is called per reconcile. The real fix was to consolidate from two Status().Update() calls per successful reconcile down to one, by removing the intermediate write after generateStatus and doing a single write at the very end of the loop after executePlan completes.
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "TemporalConnectionNotFound", | ||
| "Unable to fetch TemporalConnection %q: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err) | ||
| r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionFalse, | ||
| "TemporalConnectionNotFound", fmt.Sprintf("TemporalConnection %q not found: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err)) | ||
| _ = r.Status().Update(ctx, &workerDeploy) |
There was a problem hiding this comment.
seems like this pattern of setting and recording events is repeated throughout the codebase; think we should clean this up by having a nice helper which could also make it look neater
wdyt
There was a problem hiding this comment.
yes, but I wouldn't block this PR on that personally
| ) | ||
|
|
||
| func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, temporalClient sdkclient.Client, p *plan) error { | ||
| func (r *TemporalWorkerDeploymentReconciler) executeK8sOperations(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, p *plan) error { |
There was a problem hiding this comment.
any specific reason to rename this? i quite liked the approach of having these three files/steps where we first generate a status, make a plan and then act on it.
There was a problem hiding this comment.
func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, temporalClient sdkclient.Client, p *plan) error { still exists (see below), it now just calls multiple helper functions in sequence, starting with executeK8sOperations. Apparently this refactor gets rid of the cognitive complexity linter error.
internal/controller/execplan.go
Outdated
| ConflictToken: vcfg.ConflictToken, | ||
| Identity: getControllerIdentity(), | ||
| }); err != nil { | ||
| r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed", |
There was a problem hiding this comment.
we should also log the error here, in addition to recording the event here imo. We have followed this elsewhere in the same file and in the code, and I think it could be good practice given that if we do have a persistence failure, the logs would still be an option for an operator.
There was a problem hiding this comment.
We are logging the error on line 166, no?
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanGenerationFailed", | ||
| "Unable to generate reconciliation plan: %v", err) | ||
| return ctrl.Result{}, err | ||
| } | ||
|
|
||
| // Execute the plan, handling any errors | ||
| if err := r.executePlan(ctx, l, temporalClient, plan); err != nil { | ||
| if err := r.executePlan(ctx, l, &workerDeploy, temporalClient, plan); err != nil { | ||
| r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanExecutionFailed", | ||
| "Unable to execute reconciliation plan: %v", err) |
There was a problem hiding this comment.
any specific reason you chose to not persist failures for plan generation and execution phases but persisted when we do generate the status from the server?
There was a problem hiding this comment.
do you mean, why just record an event and not also set a condition?
The only conditions we have right now are ConditionTemporalConnectionHealthy and ConditionRolloutComplete. ConditionRolloutComplete is set in execplan right after the SetCurrent call succeeds (so that we don't do it on every iteration).
Ah actually, I'm not sure if doing
// Mark TemporalConnection as valid since we fetched it and resolved auth
r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionTrue,
temporaliov1alpha1.ReasonTemporalConnectionHealthy, "TemporalConnection is healthy and auth secret is resolved")
unconditionally on every iteration is good or bad. Will ask Claude
There was a problem hiding this comment.
turns out, setCondition doesn't do the write, r.Status().Update() does, so my concern about setting ConditionTemporalConnectionHealthy on every iteration being potentially wasteful was wrong
carlydf
left a comment
There was a problem hiding this comment.
really close from my perspective. will push a commit showing what I mean about the stricter string types for EventType and ConditionType.
internal/controller/execplan.go
Outdated
| Identity: getControllerIdentity(), | ||
| }); err != nil { | ||
| l.Error(err, "unable to set ramping deployment version", "buildID", vcfg.BuildID, "percentage", vcfg.RampPercentage) | ||
| r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed", |
There was a problem hiding this comment.
sorry to nitpick, but "version registration" sounds to me like what happens when a versioned poller first arrives, polling the server and creating a version record. Then, you could say the version has been "registered" because it exists. Anyway, "VersionRoutingConfigUpdateFailed" or "VersionPromotionFailed" or "RoutingConfigUpdateFailed" makes more sense to me. But open to hearing reasons for "VersionRegistrationFailed" being better! And I also don't want to "bikeshed" naming, but at the same time want to choose names that reinforce existing terminology that we use elsewhere in the versioning world (ie. "routing config" or "version promotion").
The other note I have is, we should consider making these event strings:
"VersionRegistrationFailed""MetadataUpdateFailed""TestWorkflowStartFailed""DeploymentXXXFailed""PlanXXXFailed"
constants so that it's easy to see a list of all the possible event types in one place, and so that in the cases where we emit the same event type in multiple places, there is no risk of spelling errors.
There was a problem hiding this comment.
| }, &temporalConnection); err != nil { | ||
| l.Error(err, "unable to fetch TemporalConnection") | ||
| r.recordWarningAndSetCondition(ctx, &workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, | ||
| "TemporalConnectionNotFound", |
There was a problem hiding this comment.
similar note as above. I'd recommend using constants for these conditions: "TemporalConnectionNotFound", "AuthSecretInvalid", "TemporalConnectionHealthy", "TemporalClientCreationFailed", "TemporalStateFetchFailed".
What I see a lot in the Go SDK which I look to as a well-designed code base is if a string value is actually a strict subset of valid strings, make a special type for it, and then use that type to define the acceptable constants as well as the input values of functions. I can check out this branch and push a commit to show what I mean, if that's cool with you!
There was a problem hiding this comment.
"Registration" already has a meaning in Temporal versioning (a worker polling for the first time creates a version record). "Promotion" better describes setting a version as current or ramping, which moves it forward in the rollout lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est scenarios ranked by trigger difficulty
| } | ||
|
|
||
| // Persist any conditions set during plan execution (e.g. RolloutComplete). | ||
| _ = r.Status().Update(ctx, &workerDeploy) |
There was a problem hiding this comment.
this was added because without it, the RolloutComplete condition was never being written back to the status, but it's wasteful, so I'm fixing that now
There was a problem hiding this comment.
addressed in only write to status once per reconcile loop
| // Note: as things are now, this will never happen, because getAPIKeySecretName only errors when secretRef == nil, | ||
| // but resolveAuthSecretName only calls it inside the tc.Spec.APIKeySecretRef != nil branch |
There was a problem hiding this comment.
we should decide what to do here. We can refactor resolveAuthSecretName to not return an error, and then completely get rid of ReasonAuthSecretInvalid, or we could just leave it "just in case".
One issue is that because temporaliov1alpha1.ReasonAuthSecretInvalid is in the public API, even if it never happens, it could be weird or feel risky to remove later. Because of that, refactoring this stack now is probably best.
One idea for the refactor: instead of refactoring to never error, we could require people to opt-in to "No Auth Mode." Then, if people don't put any auth mode in, there would be an error that we would surface here. Tbh, that's probably the safest solution.
What changed: Added Kubernetes events and status conditions
(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.
##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.
Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28
How was this tested:
added unit tests
Any docs updates needed?
N/A