feat: add validating admission webhook mechanism for WekaCluster (OP-83)#2468
feat: add validating admission webhook mechanism for WekaCluster (OP-83)#2468
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
How to use the Graphite Merge QueueAdd the label main-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has required the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Claude finished @caspx's task in 3m 14s —— View job Code Review: Validating Admission Webhook for WekaCluster (OP-83)
Overall this is a well-structured skeleton. The cert management is solid, the multi-replica conflict handling in VWC is thoughtful, the fail-open default is appropriate, and the update-filtering design (allow pre-existing violations) is a good UX decision. A few real issues to address before this goes to production. Issues1. SAN mismatch not detected on restart —
// isCertValid should also verify the cert covers the current service name:
for _, san := range cert.DNSNames {
if san == fmt.Sprintf("%s.%s.svc", m.config.ServiceName, m.namespace) {
// SANs match
}
}2.
if !secretExists {
if createErr := m.client.Create(ctx, newSecret); createErr != nil {
return nil, fmt.Errorf("failed to create cert secret: %w", createErr) // ← AlreadyExists crashes the setup
}
}The operator currently runs with 1 replica, so this is low-risk today. But given that 3.
A more robust approach compares on type errKey struct{ field, errType, detail string }
// key := errKey{e.Field, string(e.Type), e.Detail}Suggestions4. No runtime cert rotation Certs are checked/rotated only at startup. If the operator runs for ~364 days, the cert gets within 24 hours of expiry on the next startup and rotates fine. But a long-lived deployment that restarts exactly when the cert is between 24h-expired and 24h-from-expiry is fine too. The real gap is: a cert issued today expires in 1 year — without a restart, the cert goes invalid while the operator is running. Consider a background goroutine that periodically re-runs 5. An unknown // current:
m.logger.Info("unknown webhook failurePolicy value, falling back to Ignore", ...)
// should be:
m.logger.Error(nil, "unknown webhook failurePolicy value, falling back to Ignore", ...)6.
Nitpicks7. The comment warns this marker is not used by controller-gen for VWC generation. Consider prefixing it differently (e.g., 8. Acknowledged in the comment ("follow-up PRs"). Fine for a skeleton, but worth linking to the follow-up ticket so reviewers can track when actual rules land. 9. $(CONTROLLER_GEN) rbac:roleName=... paths="./internal/controllers/..." paths="./internal/webhooks/..."
What's well done
|
Graphite Automations"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (04/20/26)3 reviewers were added to this PR based on Anton Bykov's automation. |
There was a problem hiding this comment.
Pull request overview
Adds the foundational plumbing for a validating admission webhook for WekaCluster, including runtime-managed certificates/Secret, runtime-created ValidatingWebhookConfiguration, and a controller-runtime validator wired into the manager and Helm chart (feature-flagged off by default).
Changes:
- Introduces
internal/webhooks/with aWebhookManager(cert/Secret + VWC lifecycle) and aWekaClustervalidating webhook scaffold (including update-tolerant “new violations only” filtering). - Wires conditional webhook server creation and webhook bootstrap into
cmd/manager/main.go. - Extends the Helm chart with webhook values, Service, env wiring, and RBAC generation updates.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/webhooks/wekacluster_webhook.go | Adds WekaCluster validating webhook implementation and error/warning filtering helpers. |
| internal/webhooks/wekacluster_webhook_test.go | Unit tests for filterNewErrors / filterNewWarnings. |
| internal/webhooks/webhook_manager.go | Implements cert generation, Secret reconcile, VWC create/update, and cleanup logic. |
| internal/webhooks/webhook_manager_test.go | Unit tests for cert validity logic, cert generation, and webhook path constant. |
| internal/config/env.go | Adds webhook-related configuration and env parsing defaults. |
| cmd/manager/main.go | Conditionally configures webhook server and runs webhook bootstrap during manager startup. |
| charts/weka-operator/values.yaml | Adds Helm values for webhook enablement and basic settings. |
| charts/weka-operator/templates/webhook-service.yaml | Adds a conditional Service to expose the webhook server. |
| charts/weka-operator/templates/role.yaml | Grants RBAC for managing validatingwebhookconfigurations. |
| charts/weka-operator/templates/manager.yaml | Exposes webhook env vars and container port when enabled. |
| Makefile | Includes webhook package in RBAC generation inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) { | ||
| directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()}) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to create direct client for webhook setup") | ||
| os.Exit(1) | ||
| } | ||
| webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger) | ||
|
|
||
| if !webhookEnabled { | ||
| cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
| defer cancel() | ||
| webhookMgr.CleanupIfExists(cleanupCtx) | ||
| return | ||
| } | ||
|
|
||
| if err := webhookMgr.EnsureCertificates(ctx); err != nil { | ||
| logger.Error(err, "Failed to ensure webhook certificates") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil { | ||
| logger.Error(err, "unable to create webhook", "webhook", "WekaCluster") | ||
| os.Exit(1) |
There was a problem hiding this comment.
setupWebhook runs before mgr.Start(), so it will execute in every replica before leader election is acquired. That means multiple pods can concurrently create/regenerate the cert Secret and update the ValidatingWebhookConfiguration, which can lead to CA bundle / serving cert mismatches (and even operator exit if a create hits AlreadyExists). Consider ensuring only the elected leader performs Secret/VWC mutation (and non-leaders only read/write certs to disk), or otherwise make the Secret/VWC reconcile fully race-safe under concurrent startup.
| now := time.Now() | ||
| caTemplate := &x509.Certificate{ | ||
| SerialNumber: caSerial, | ||
| Subject: pkix.Name{ | ||
| CommonName: "weka-operator-webhook-ca", | ||
| Organization: []string{"weka.io"}, | ||
| }, | ||
| NotBefore: now, | ||
| NotAfter: now.Add(365 * 24 * time.Hour), | ||
| IsCA: true, | ||
| KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageCRLSign, | ||
| BasicConstraintsValid: true, | ||
| } | ||
|
|
||
| caCertDER, err := x509.CreateCertificate(rand.Reader, caTemplate, caTemplate, &caKey.PublicKey, caKey) | ||
| if err != nil { | ||
| return nil, nil, nil, fmt.Errorf("failed to create CA certificate: %w", err) | ||
| } | ||
|
|
||
| caCert, err := x509.ParseCertificate(caCertDER) | ||
| if err != nil { | ||
| return nil, nil, nil, fmt.Errorf("failed to parse CA certificate: %w", err) | ||
| } | ||
|
|
||
| // Generate leaf key | ||
| leafKey, err := rsa.GenerateKey(rand.Reader, 2048) | ||
| if err != nil { | ||
| return nil, nil, nil, fmt.Errorf("failed to generate leaf key: %w", err) | ||
| } | ||
|
|
||
| leafSerial, err := rand.Int(rand.Reader, new(big.Int).Lsh(big.NewInt(1), 128)) | ||
| if err != nil { | ||
| return nil, nil, nil, fmt.Errorf("failed to generate leaf serial: %w", err) | ||
| } | ||
|
|
||
| svcName := m.config.ServiceName | ||
| ns := m.namespace | ||
| dnsNames := []string{ | ||
| svcName, | ||
| fmt.Sprintf("%s.%s.svc", svcName, ns), | ||
| fmt.Sprintf("%s.%s.svc.cluster.local", svcName, ns), | ||
| } | ||
|
|
||
| leafTemplate := &x509.Certificate{ | ||
| SerialNumber: leafSerial, | ||
| Subject: pkix.Name{ | ||
| CommonName: fmt.Sprintf("%s.%s.svc", svcName, ns), | ||
| Organization: []string{"weka.io"}, | ||
| }, | ||
| DNSNames: dnsNames, | ||
| NotBefore: now, | ||
| NotAfter: now.Add(365 * 24 * time.Hour), | ||
| KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment, |
There was a problem hiding this comment.
generateCertificates sets NotBefore to time.Now() for both CA and leaf certificates. In clusters with clock skew (API server vs. webhook pod), the cert can be considered "not yet valid" and TLS handshakes may fail until clocks converge. Backdate NotBefore by a small buffer (e.g., a few minutes or an hour) to make startup resilient to skew.
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) | ||
| } | ||
| } else { | ||
| secret.Data = newSecret.Data | ||
| secret.Type = newSecret.Type | ||
| if secret.Labels == nil { | ||
| secret.Labels = map[string]string{} | ||
| } | ||
| for k, v := range newSecret.Labels { | ||
| secret.Labels[k] = v | ||
| } | ||
| if updateErr := m.client.Update(ctx, secret); updateErr != nil { | ||
| return nil, fmt.Errorf("failed to update cert secret: %w", updateErr) | ||
| } |
There was a problem hiding this comment.
ensureCertSecret isn't safe under concurrent startup: if two pods observe NotFound, one Create will succeed and the other will fail with AlreadyExists; similarly, Update can hit resourceVersion conflicts. Since setup runs before leader election, this can crash non-leader pods and/or rotate certs unexpectedly. Handle AlreadyExists by re-Get/reuse the existing Secret, and wrap Update in RetryOnConflict (or use Patch with optimistic locking) to make the reconcile idempotent under contention.
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) | |
| } | |
| } else { | |
| secret.Data = newSecret.Data | |
| secret.Type = newSecret.Type | |
| if secret.Labels == nil { | |
| secret.Labels = map[string]string{} | |
| } | |
| for k, v := range newSecret.Labels { | |
| secret.Labels[k] = v | |
| } | |
| if updateErr := m.client.Update(ctx, secret); updateErr != nil { | |
| return nil, fmt.Errorf("failed to update cert secret: %w", updateErr) | |
| } | |
| if !apierrors.IsAlreadyExists(createErr) { | |
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) | |
| } | |
| existingSecret := &corev1.Secret{} | |
| if regetErr := m.client.Get(ctx, secretKey, existingSecret); regetErr != nil { | |
| return nil, fmt.Errorf("failed to get concurrently created cert secret: %w", regetErr) | |
| } | |
| existingTLSCert, ok1 := existingSecret.Data["tls.crt"] | |
| existingTLSKey, ok2 := existingSecret.Data["tls.key"] | |
| existingCACert, ok3 := existingSecret.Data["ca.crt"] | |
| if ok1 && ok2 && ok3 { | |
| valid, reason := m.isCertValid(existingTLSCert) | |
| if valid { | |
| m.logger.Info("Using concurrently created webhook cert secret", "secret", m.config.SecretName) | |
| if writeErr := m.writeCertsToDisk(existingTLSCert, existingTLSKey); writeErr != nil { | |
| return nil, writeErr | |
| } | |
| return existingCACert, nil | |
| } | |
| return nil, fmt.Errorf("concurrently created cert secret is invalid: %s", reason) | |
| } | |
| return nil, fmt.Errorf("concurrently created cert secret is missing required keys") | |
| } | |
| } else { | |
| var reusedCA []byte | |
| var reusedTLSCert []byte | |
| var reusedTLSKey []byte | |
| updateErr := retry.RetryOnConflict(retry.DefaultRetry, func() error { | |
| currentSecret := &corev1.Secret{} | |
| if err := m.client.Get(ctx, secretKey, currentSecret); err != nil { | |
| return err | |
| } | |
| currentTLSCert, ok1 := currentSecret.Data["tls.crt"] | |
| currentTLSKey, ok2 := currentSecret.Data["tls.key"] | |
| currentCACert, ok3 := currentSecret.Data["ca.crt"] | |
| if ok1 && ok2 && ok3 { | |
| valid, _ := m.isCertValid(currentTLSCert) | |
| if valid { | |
| reusedCA = currentCACert | |
| reusedTLSCert = currentTLSCert | |
| reusedTLSKey = currentTLSKey | |
| return nil | |
| } | |
| } | |
| currentSecret.Data = newSecret.Data | |
| currentSecret.Type = newSecret.Type | |
| if currentSecret.Labels == nil { | |
| currentSecret.Labels = map[string]string{} | |
| } | |
| for k, v := range newSecret.Labels { | |
| currentSecret.Labels[k] = v | |
| } | |
| return m.client.Update(ctx, currentSecret) | |
| }) | |
| if updateErr != nil { | |
| return nil, fmt.Errorf("failed to update cert secret: %w", updateErr) | |
| } | |
| if reusedCA != nil { | |
| m.logger.Info("Using concurrently updated webhook cert secret", "secret", m.config.SecretName) | |
| if writeErr := m.writeCertsToDisk(reusedTLSCert, reusedTLSKey); writeErr != nil { | |
| return nil, writeErr | |
| } | |
| return reusedCA, nil | |
| } |
| - name: WEBHOOK_SERVICE_NAME | ||
| value: "{{ .Values.prefix }}-webhook-service" | ||
| - name: WEBHOOK_NAME | ||
| value: "{{ .Values.prefix }}-validating-webhook-configuration" |
There was a problem hiding this comment.
Helm sets WEBHOOK_SERVICE_NAME and WEBHOOK_NAME based on .Values.prefix, but WEBHOOK_SECRET_NAME remains the hard-coded default ("weka-operator-webhook-server-cert") from env.go. This can cause name collisions across multiple Helm releases in the same namespace and makes the Secret name inconsistent with the prefixed Service/VWC names. Consider adding .Values.webhook.secretName (defaulting to a prefix-based name) and wiring it into WEBHOOK_SECRET_NAME here.
| value: "{{ .Values.prefix }}-validating-webhook-configuration" | |
| value: "{{ .Values.prefix }}-validating-webhook-configuration" | |
| - name: WEBHOOK_SECRET_NAME | |
| value: "{{ default (printf "%s-webhook-server-cert" .Values.prefix) .Values.webhook.secretName }}" |
e8ecb06 to
512f17e
Compare
512f17e to
6e96fa6
Compare
|
Claude finished @caspx's task in 1m 49s —— View job
Code Review UpdateGood fixes landed since the last review. Here's the updated status: Fixed ✅Issue #1 — SAN validation in Issue #3 — Issue #5 — Copilot clock-skew comment ( Still Open1. This is unchanged from the previous review. With one replica it won't trigger, but it's inconsistent with if createErr := m.client.Create(ctx, newSecret); createErr != nil {
if !apierrors.IsAlreadyExists(createErr) {
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
}
// Another replica won the race — re-fetch and reuse its cert
if regetErr := m.client.Get(ctx, secretKey, secret); regetErr != nil {
return nil, fmt.Errorf("failed to get concurrently created cert secret: %w", regetErr)
}
// ... validate and use secret.Data
}2.
Non-blocking reminderNo runtime cert rotation — certs are checked/rotated only at startup. A deployment running continuously will hit cert expiry after ~1 year without a restart. Worth a follow-up ticket for a background rotation goroutine. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| valid, reason := m.isCertValid(tlsCert) | ||
| if valid { | ||
| m.logger.Info("Using existing webhook cert secret", "secret", m.config.SecretName) | ||
| if writeErr := m.writeCertsToDisk(tlsCert, tlsKey); writeErr != nil { | ||
| return nil, writeErr |
There was a problem hiding this comment.
When reusing an existing cert Secret, isCertValid() only validates the leaf cert (expiry + SAN) and then returns the stored ca.crt without checking that the CA actually signs the leaf. If ca.crt and tls.crt ever get out of sync (manual edits, partial updates), the webhook TLS handshake will fail while this code still considers the Secret “valid”. Consider verifying the leaf against ca.crt (x509.Verify with a pool) before accepting the Secret, and regenerate if verification fails.
| // Non-field error in the new set. If the full message didn't appear | ||
| // in old, return the whole newErr to avoid silently dropping a | ||
| // genuinely new violation. | ||
| if _, existed := oldMessages[e.Error()]; !existed { | ||
| return newErr |
There was a problem hiding this comment.
filterNewErrors() is documented as returning only the errors newly introduced by the update, but in the non-*field.Error branch it returns the entire newErr when it finds a new non-field error. If newErr is an aggregate containing both pre-existing field errors and a new non-field error, this will re-include the pre-existing errors in the rejection message (and makes the behavior differ from the doc). Consider collecting and returning only the new non-field error(s) (plus any new field errors) instead of returning newErr wholesale.
| - name: WEBHOOK_ENABLED | ||
| value: "{{ .Values.webhook.enabled }}" | ||
| - name: WEBHOOK_PORT | ||
| value: "{{ .Values.webhook.port }}" | ||
| - name: WEBHOOK_CERT_DIR |
There was a problem hiding this comment.
The chart wires several webhook env vars unconditionally, but the PR description calls out “conditional env vars in manager.yaml”. If these are intended to be conditional, wrap this block in if .Values.webhook.enabled to avoid advertising webhook config when it’s off. Also, WEBHOOK_SECRET_NAME is supported in config/env.go but isn’t set here; without it, custom .Values.prefix installs in the same namespace will still share the same Secret name (risking collisions).
| # -- Webhook server port | ||
| port: 9443 | ||
| # -- Directory for TLS certificates | ||
| certDir: "/tmp/k8s-webhook-server/serving-certs" |
There was a problem hiding this comment.
values.yaml exposes webhook.enabled/port/certDir/failurePolicy, but the operator also supports WEBHOOK_SECRET_NAME and uses a cluster-scoped WEBHOOK_NAME. Consider adding these (and possibly serviceName) to values.yaml so multi-release installs can avoid Secret/name collisions and users can customize all webhook-related resource names consistently via Helm.
| certDir: "/tmp/k8s-webhook-server/serving-certs" | |
| certDir: "/tmp/k8s-webhook-server/serving-certs" | |
| # -- Validating webhook configuration name | |
| name: "weka-operator-webhook" | |
| # -- Service name backing the webhook server | |
| serviceName: "weka-operator-webhook-service" | |
| # -- Secret name containing the webhook serving certificates | |
| secretName: "weka-operator-webhook-server-cert" |
| // 1. Create a direct (non-cached) K8s client — mgr.GetClient() is cache-backed | ||
| // and only usable after mgr.Start(). | ||
| // 2. EnsureCertificates: check/generate cert Secret, write certs to disk, create | ||
| // or update the ValidatingWebhookConfiguration with the CA bundle. | ||
| // 3. Register the WekaCluster validator with the manager. | ||
| // | ||
| // When disabled: | ||
| // - Best-effort delete any leftover VWC from a previous enabled run. Runs on | ||
| // STARTUP only, never on shutdown (deleting on every shutdown would wipe | ||
| // validation during rolling updates). | ||
| func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) { | ||
| directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()}) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to create direct client for webhook setup") | ||
| os.Exit(1) | ||
| } | ||
| webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger) | ||
|
|
||
| if !webhookEnabled { | ||
| cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
| defer cancel() | ||
| webhookMgr.CleanupIfExists(cleanupCtx) | ||
| return | ||
| } | ||
|
|
||
| if err := webhookMgr.EnsureCertificates(ctx); err != nil { | ||
| logger.Error(err, "Failed to ensure webhook certificates") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil { | ||
| logger.Error(err, "unable to create webhook", "webhook", "WekaCluster") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| logger.Info("WekaCluster validating webhook enabled") |
There was a problem hiding this comment.
setupWebhook() is invoked before mgr.Start(), so it runs before controller-runtime leader election takes effect. In a multi-replica deployment, every pod will attempt Secret/VWC reconciliation and cert generation concurrently at startup, which can cause races and potentially mismatched CA bundles vs serving certs across pods. Consider gating EnsureCertificates/VWC reconciliation to the elected leader (or a dedicated Lease lock), while letting all replicas serve using the shared Secret.
| // 1. Create a direct (non-cached) K8s client — mgr.GetClient() is cache-backed | |
| // and only usable after mgr.Start(). | |
| // 2. EnsureCertificates: check/generate cert Secret, write certs to disk, create | |
| // or update the ValidatingWebhookConfiguration with the CA bundle. | |
| // 3. Register the WekaCluster validator with the manager. | |
| // | |
| // When disabled: | |
| // - Best-effort delete any leftover VWC from a previous enabled run. Runs on | |
| // STARTUP only, never on shutdown (deleting on every shutdown would wipe | |
| // validation during rolling updates). | |
| func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) { | |
| directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()}) | |
| if err != nil { | |
| logger.Error(err, "Failed to create direct client for webhook setup") | |
| os.Exit(1) | |
| } | |
| webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger) | |
| if !webhookEnabled { | |
| cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | |
| defer cancel() | |
| webhookMgr.CleanupIfExists(cleanupCtx) | |
| return | |
| } | |
| if err := webhookMgr.EnsureCertificates(ctx); err != nil { | |
| logger.Error(err, "Failed to ensure webhook certificates") | |
| os.Exit(1) | |
| } | |
| if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil { | |
| logger.Error(err, "unable to create webhook", "webhook", "WekaCluster") | |
| os.Exit(1) | |
| } | |
| logger.Info("WekaCluster validating webhook enabled") | |
| // 1. Register the WekaCluster validator with the manager on every replica so | |
| // each pod can serve webhook traffic. | |
| // 2. Reconcile webhook certificates and the ValidatingWebhookConfiguration only | |
| // on the elected leader, after manager leader election is active. | |
| // | |
| // When disabled: | |
| // - Best-effort delete any leftover VWC from a previous enabled run. This is | |
| // also gated by leader election so only one replica performs the cleanup. | |
| type leaderElectionRunnable struct { | |
| start func(context.Context) error | |
| } | |
| func (r leaderElectionRunnable) Start(ctx context.Context) error { | |
| return r.start(ctx) | |
| } | |
| func (leaderElectionRunnable) NeedLeaderElection() bool { | |
| return true | |
| } | |
| func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) { | |
| if webhookEnabled { | |
| if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil { | |
| logger.Error(err, "unable to create webhook", "webhook", "WekaCluster") | |
| os.Exit(1) | |
| } | |
| } | |
| if err := mgr.Add(leaderElectionRunnable{ | |
| start: func(runCtx context.Context) error { | |
| directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()}) | |
| if err != nil { | |
| logger.Error(err, "Failed to create direct client for webhook setup") | |
| return err | |
| } | |
| webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger) | |
| if !webhookEnabled { | |
| cleanupCtx, cancel := context.WithTimeout(runCtx, 10*time.Second) | |
| defer cancel() | |
| webhookMgr.CleanupIfExists(cleanupCtx) | |
| <-runCtx.Done() | |
| return nil | |
| } | |
| if err := webhookMgr.EnsureCertificates(runCtx); err != nil { | |
| logger.Error(err, "Failed to ensure webhook certificates") | |
| return err | |
| } | |
| logger.Info("WekaCluster validating webhook enabled") | |
| <-runCtx.Done() | |
| return nil | |
| }, | |
| }); err != nil { | |
| logger.Error(err, "failed to register leader-elected webhook setup runnable") | |
| os.Exit(1) | |
| } |
|
|
||
| if !secretExists { | ||
| if createErr := m.client.Create(ctx, newSecret); createErr != nil { | ||
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) |
There was a problem hiding this comment.
ensureCertSecret() has a startup race when multiple operator replicas start at once: both can observe NotFound, generate different certs, and one Create() will fail with AlreadyExists (currently returned as an error). Even worse, if you later change this to Update-on-exists, it could overwrite the secret with a different CA/cert and break other pods. Handle IsAlreadyExists on create by re-GETing the Secret and using the stored certs (or retry) instead of failing/overwriting.
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) | |
| if apierrors.IsAlreadyExists(createErr) { | |
| m.logger.Info("Webhook cert secret was created concurrently; reloading existing secret", "secret", m.config.SecretName) | |
| secret = &corev1.Secret{} | |
| if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.SecretName, Namespace: m.namespace}, secret); getErr != nil { | |
| return nil, fmt.Errorf("cert secret was created concurrently but failed to re-fetch it: %w", getErr) | |
| } | |
| } else { | |
| return nil, fmt.Errorf("failed to create cert secret: %w", createErr) | |
| } | |
| } else { | |
| secret = newSecret |
| if apierrors.IsAlreadyExists(createErr) { | ||
| return createErr // trigger retry; another replica created it first | ||
| } | ||
| return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr) | ||
| } | ||
| m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName) | ||
| return nil |
There was a problem hiding this comment.
ensureWebhookConfiguration() expects RetryOnConflict to retry when Create() returns AlreadyExists (another replica created the VWC first), but RetryOnConflict only retries on conflict errors; AlreadyExists typically won’t be retried and will bubble up, potentially crashing a replica on startup. Consider handling IsAlreadyExists explicitly by re-GETing the VWC and proceeding to the update path (or use a retry loop that retries on AlreadyExists as well).
| if apierrors.IsAlreadyExists(createErr) { | |
| return createErr // trigger retry; another replica created it first | |
| } | |
| return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr) | |
| } | |
| m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName) | |
| return nil | |
| if !apierrors.IsAlreadyExists(createErr) { | |
| return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr) | |
| } | |
| if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.WebhookName}, existing); getErr != nil { | |
| return fmt.Errorf("failed to get ValidatingWebhookConfiguration after concurrent create: %w", getErr) | |
| } | |
| } else { | |
| m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName) | |
| return nil | |
| } |
6e96fa6 to
1a076ac
Compare
|
Claude finished @caspx's task in 2m 10s —— View job
No new commits since the previous review — the branch still has one commit ( Still Open1. Unchanged. If two replicas both observe 2. Unchanged. 3. The comment says "trigger retry; another replica created it first" but if createErr := m.client.Create(ctx, desired); createErr != nil {
if !apierrors.IsAlreadyExists(createErr) {
return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr)
}
// Another replica won the race — re-fetch and fall through to update
if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.WebhookName}, existing); getErr != nil {
return fmt.Errorf("failed to get VWC after concurrent create: %w", getErr)
}
}Suggestion4. When reusing an existing Secret, pool := x509.NewCertPool()
pool.AppendCertsFromPEM(caCert)
if _, err := cert.Verify(x509.VerifyOptions{Roots: pool}); err != nil {
return false, fmt.Sprintf("cert not signed by stored CA: %v", err)
}Non-blocking reminderNo runtime cert rotation — certs rotate on restart only. A deployment running continuously will hit expiry after ~1 year. Worth a follow-up ticket for a background rotation goroutine (e.g., every 12 hours). |

TL;DR
Ships the infrastructure for a validating admission webhook on WekaCluster.
Hand-rolled crypto, zero external deps, VWC created at runtime, feature-flagged off by default,
failurePolicy: Ignore.What changed?
internal/webhooks/— cert generation + Secret reconcile + VWC CRUD + cleanup, plus the validator with update-tolerant error filtering (pre-existing violations don't block operator updates)cmd/manager/main.go—setupWebhookhelper, registers the validator synchronously beforemgr.Start()manager.yaml, defaultswebhook.enabled=false/failurePolicy=IgnoreHow to test?
go test ./internal/webhooks/...