Skip to content

feat: add validating admission webhook mechanism for WekaCluster (OP-83)#2468

Open
caspx wants to merge 1 commit intomainfrom
admission_controller_shared_drives_clean
Open

feat: add validating admission webhook mechanism for WekaCluster (OP-83)#2468
caspx wants to merge 1 commit intomainfrom
admission_controller_shared_drives_clean

Conversation

@caspx
Copy link
Copy Markdown
Contributor

@caspx caspx commented Apr 20, 2026

TL;DR

Ships the infrastructure for a validating admission webhook on WekaCluster.
Hand-rolled crypto, zero external deps, VWC created at runtime, feature-flagged off by default, failurePolicy: Ignore.

What changed?

  • New internal/webhooks/ — cert generation + Secret reconcile + VWC CRUD + cleanup, plus the validator with update-tolerant error filtering (pre-existing violations don't block operator updates)
  • cmd/manager/main.gosetupWebhook helper, registers the validator synchronously before mgr.Start()
  • Chart: new webhook Service (443→9443), conditional env vars in manager.yaml, defaults webhook.enabled=false / failurePolicy=Ignore

How to test?

  • go test ./internal/webhooks/...

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 20, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Copy Markdown
Contributor Author

caspx commented Apr 20, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @caspx's task in 3m 14s —— View job


Code Review: Validating Admission Webhook for WekaCluster (OP-83)

  • Gather context and read changed files
  • Review webhook manager implementation
  • Review WekaCluster webhook logic
  • Review tests
  • Review Helm chart changes
  • Review main.go and config changes
  • Post final review

Overall this is a well-structured skeleton. The cert management is solid, the multi-replica conflict handling in VWC is thoughtful, the fail-open default is appropriate, and the update-filtering design (allow pre-existing violations) is a good UX decision. A few real issues to address before this goes to production.


Issues

1. SAN mismatch not detected on restartwebhook_manager.go:271-287

isCertValid only checks expiry. If WEBHOOK_SERVICE_NAME changes between deployments (e.g., prefix rename), the stored cert's SANs won't match the new service name, but the cert passes validation and gets reused — causing TLS handshake failures that are hard to diagnose.

// isCertValid should also verify the cert covers the current service name:
for _, san := range cert.DNSNames {
    if san == fmt.Sprintf("%s.%s.svc", m.config.ServiceName, m.namespace) {
        // SANs match
    }
}

Fix this →


2. ensureCertSecret doesn't handle AlreadyExists on Createwebhook_manager.go:243-246

ensureWebhookConfiguration explicitly handles the AlreadyExists race (two replicas creating simultaneously), but ensureCertSecret does not:

if !secretExists {
    if createErr := m.client.Create(ctx, newSecret); createErr != nil {
        return nil, fmt.Errorf("failed to create cert secret: %w", createErr) // ← AlreadyExists crashes the setup
    }
}

The operator currently runs with 1 replica, so this is low-risk today. But given that ensureWebhookConfiguration already shows awareness of multi-replica races (its comment mentions "another replica created it first"), the cert secret create should be consistent. If AlreadyExists, fall back to a Get and use the winner's cert.


3. filterNewErrors string-matching can misclassify same-rule violations with different valueswekacluster_webhook.go:86-117

field.Error.Error() includes the field value: spec.drives: Invalid value: 0: .... If a user changes an invalid field value (0 → -1), the message changes, so the filter treats it as a new violation and rejects the update — even though the same rule was already violated. This means users can't change an invalid value at all, which contradicts the stated intent of "allow pre-existing violations through."

A more robust approach compares on (Field, Type, Detail) rather than the full error string:

type errKey struct{ field, errType, detail string }
// key := errKey{e.Field, string(e.Type), e.Detail}

Fix this →


Suggestions

4. No runtime cert rotation

Certs are checked/rotated only at startup. If the operator runs for ~364 days, the cert gets within 24 hours of expiry on the next startup and rotates fine. But a long-lived deployment that restarts exactly when the cert is between 24h-expired and 24h-from-expiry is fine too. The real gap is: a cert issued today expires in 1 year — without a restart, the cert goes invalid while the operator is running. Consider a background goroutine that periodically re-runs EnsureCertificates (e.g., every 12 hours). This doesn't need to be in this PR but is worth a follow-up ticket.


5. resolveFailurePolicy logs misconfiguration at Info levelwebhook_manager.go:393

An unknown failurePolicy value is a misconfiguration. It should log at Warn/Error to be visible in production:

// current:
m.logger.Info("unknown webhook failurePolicy value, falling back to Ignore", ...)
// should be:
m.logger.Error(nil, "unknown webhook failurePolicy value, falling back to Ignore", ...)

Fix this →


6. WEBHOOK_SECRET_NAME is not Helm-configurable

manager.yaml exposes WEBHOOK_SERVICE_NAME, WEBHOOK_NAME, and WEBHOOK_FAILURE_POLICY, but not WEBHOOK_SECRET_NAME. If an operator needs to reference a pre-provisioned secret (e.g., cert-manager managed), they can't override the secret name via Helm. Minor for now, but worth adding webhook.secretName to values.yaml and the env block in manager.yaml.


Nitpicks

7. kubebuilder:webhook marker is "informational only" but could confuse make generatewekacluster_webhook.go:31

The comment warns this marker is not used by controller-gen for VWC generation. Consider prefixing it differently (e.g., // DOC: or removing it) to prevent future contributors from accidentally removing the "informational" comment and wondering why make generate doesn't produce a VWC manifest. At minimum, the existing comment is clear enough, but a short // +nocontrollergen guard would be cleaner.

8. validateWekaCluster is currently a pass-throughwekacluster_webhook.go:72-74

Acknowledged in the comment ("follow-up PRs"). Fine for a skeleton, but worth linking to the follow-up ticket so reviewers can track when actual rules land.

9. Makefile: paths= duplication in rbac targetMakefile:116

$(CONTROLLER_GEN) rbac:roleName=... paths="./internal/controllers/..." paths="./internal/webhooks/..."

controller-gen supports comma-separated paths: paths="{./internal/controllers/...,./internal/webhooks/...}" — minor style preference, current form also works.


What's well done

  • The CleanupIfExists-on-startup-only design avoids the rolling update VWC deletion gap — this is a subtle correctness detail that's often missed.
  • ensureWebhookConfiguration correctly handles the AlreadyExists race for the VWC itself with RetryOnConflict.
  • failurePolicy=Ignore default is the right fail-open choice for a new webhook.
  • The direct (non-cached) client in setupWebhook is the correct pattern for pre-mgr.Start() K8s writes.
  • Test coverage for cert validity, cert generation, VWC path derivation, and filter logic is solid.

@caspx caspx marked this pull request as ready for review April 20, 2026 06:42
@caspx caspx requested a review from a team as a code owner April 20, 2026 06:42
Copilot AI review requested due to automatic review settings April 20, 2026 06:42
@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @caspx's task in 27s —— View job


No changes since the previous review — see the full analysis above.

@graphite-app
Copy link
Copy Markdown

graphite-app bot commented Apr 20, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (04/20/26)

3 reviewers were added to this PR based on Anton Bykov's automation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the foundational plumbing for a validating admission webhook for WekaCluster, including runtime-managed certificates/Secret, runtime-created ValidatingWebhookConfiguration, and a controller-runtime validator wired into the manager and Helm chart (feature-flagged off by default).

Changes:

  • Introduces internal/webhooks/ with a WebhookManager (cert/Secret + VWC lifecycle) and a WekaCluster validating webhook scaffold (including update-tolerant “new violations only” filtering).
  • Wires conditional webhook server creation and webhook bootstrap into cmd/manager/main.go.
  • Extends the Helm chart with webhook values, Service, env wiring, and RBAC generation updates.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/webhooks/wekacluster_webhook.go Adds WekaCluster validating webhook implementation and error/warning filtering helpers.
internal/webhooks/wekacluster_webhook_test.go Unit tests for filterNewErrors / filterNewWarnings.
internal/webhooks/webhook_manager.go Implements cert generation, Secret reconcile, VWC create/update, and cleanup logic.
internal/webhooks/webhook_manager_test.go Unit tests for cert validity logic, cert generation, and webhook path constant.
internal/config/env.go Adds webhook-related configuration and env parsing defaults.
cmd/manager/main.go Conditionally configures webhook server and runs webhook bootstrap during manager startup.
charts/weka-operator/values.yaml Adds Helm values for webhook enablement and basic settings.
charts/weka-operator/templates/webhook-service.yaml Adds a conditional Service to expose the webhook server.
charts/weka-operator/templates/role.yaml Grants RBAC for managing validatingwebhookconfigurations.
charts/weka-operator/templates/manager.yaml Exposes webhook env vars and container port when enabled.
Makefile Includes webhook package in RBAC generation inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/manager/main.go
Comment on lines +422 to +444
func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) {
directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()})
if err != nil {
logger.Error(err, "Failed to create direct client for webhook setup")
os.Exit(1)
}
webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger)

if !webhookEnabled {
cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
webhookMgr.CleanupIfExists(cleanupCtx)
return
}

if err := webhookMgr.EnsureCertificates(ctx); err != nil {
logger.Error(err, "Failed to ensure webhook certificates")
os.Exit(1)
}

if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil {
logger.Error(err, "unable to create webhook", "webhook", "WekaCluster")
os.Exit(1)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupWebhook runs before mgr.Start(), so it will execute in every replica before leader election is acquired. That means multiple pods can concurrently create/regenerate the cert Secret and update the ValidatingWebhookConfiguration, which can lead to CA bundle / serving cert mismatches (and even operator exit if a create hits AlreadyExists). Consider ensuring only the elected leader performs Secret/VWC mutation (and non-leaders only read/write certs to disk), or otherwise make the Secret/VWC reconcile fully race-safe under concurrent startup.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +165
now := time.Now()
caTemplate := &x509.Certificate{
SerialNumber: caSerial,
Subject: pkix.Name{
CommonName: "weka-operator-webhook-ca",
Organization: []string{"weka.io"},
},
NotBefore: now,
NotAfter: now.Add(365 * 24 * time.Hour),
IsCA: true,
KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageCRLSign,
BasicConstraintsValid: true,
}

caCertDER, err := x509.CreateCertificate(rand.Reader, caTemplate, caTemplate, &caKey.PublicKey, caKey)
if err != nil {
return nil, nil, nil, fmt.Errorf("failed to create CA certificate: %w", err)
}

caCert, err := x509.ParseCertificate(caCertDER)
if err != nil {
return nil, nil, nil, fmt.Errorf("failed to parse CA certificate: %w", err)
}

// Generate leaf key
leafKey, err := rsa.GenerateKey(rand.Reader, 2048)
if err != nil {
return nil, nil, nil, fmt.Errorf("failed to generate leaf key: %w", err)
}

leafSerial, err := rand.Int(rand.Reader, new(big.Int).Lsh(big.NewInt(1), 128))
if err != nil {
return nil, nil, nil, fmt.Errorf("failed to generate leaf serial: %w", err)
}

svcName := m.config.ServiceName
ns := m.namespace
dnsNames := []string{
svcName,
fmt.Sprintf("%s.%s.svc", svcName, ns),
fmt.Sprintf("%s.%s.svc.cluster.local", svcName, ns),
}

leafTemplate := &x509.Certificate{
SerialNumber: leafSerial,
Subject: pkix.Name{
CommonName: fmt.Sprintf("%s.%s.svc", svcName, ns),
Organization: []string{"weka.io"},
},
DNSNames: dnsNames,
NotBefore: now,
NotAfter: now.Add(365 * 24 * time.Hour),
KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment,
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateCertificates sets NotBefore to time.Now() for both CA and leaf certificates. In clusters with clock skew (API server vs. webhook pod), the cert can be considered "not yet valid" and TLS handshakes may fail until clocks converge. Backdate NotBefore by a small buffer (e.g., a few minutes or an hour) to make startup resilient to skew.

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +257
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
}
} else {
secret.Data = newSecret.Data
secret.Type = newSecret.Type
if secret.Labels == nil {
secret.Labels = map[string]string{}
}
for k, v := range newSecret.Labels {
secret.Labels[k] = v
}
if updateErr := m.client.Update(ctx, secret); updateErr != nil {
return nil, fmt.Errorf("failed to update cert secret: %w", updateErr)
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureCertSecret isn't safe under concurrent startup: if two pods observe NotFound, one Create will succeed and the other will fail with AlreadyExists; similarly, Update can hit resourceVersion conflicts. Since setup runs before leader election, this can crash non-leader pods and/or rotate certs unexpectedly. Handle AlreadyExists by re-Get/reuse the existing Secret, and wrap Update in RetryOnConflict (or use Patch with optimistic locking) to make the reconcile idempotent under contention.

Suggested change
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
}
} else {
secret.Data = newSecret.Data
secret.Type = newSecret.Type
if secret.Labels == nil {
secret.Labels = map[string]string{}
}
for k, v := range newSecret.Labels {
secret.Labels[k] = v
}
if updateErr := m.client.Update(ctx, secret); updateErr != nil {
return nil, fmt.Errorf("failed to update cert secret: %w", updateErr)
}
if !apierrors.IsAlreadyExists(createErr) {
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
}
existingSecret := &corev1.Secret{}
if regetErr := m.client.Get(ctx, secretKey, existingSecret); regetErr != nil {
return nil, fmt.Errorf("failed to get concurrently created cert secret: %w", regetErr)
}
existingTLSCert, ok1 := existingSecret.Data["tls.crt"]
existingTLSKey, ok2 := existingSecret.Data["tls.key"]
existingCACert, ok3 := existingSecret.Data["ca.crt"]
if ok1 && ok2 && ok3 {
valid, reason := m.isCertValid(existingTLSCert)
if valid {
m.logger.Info("Using concurrently created webhook cert secret", "secret", m.config.SecretName)
if writeErr := m.writeCertsToDisk(existingTLSCert, existingTLSKey); writeErr != nil {
return nil, writeErr
}
return existingCACert, nil
}
return nil, fmt.Errorf("concurrently created cert secret is invalid: %s", reason)
}
return nil, fmt.Errorf("concurrently created cert secret is missing required keys")
}
} else {
var reusedCA []byte
var reusedTLSCert []byte
var reusedTLSKey []byte
updateErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
currentSecret := &corev1.Secret{}
if err := m.client.Get(ctx, secretKey, currentSecret); err != nil {
return err
}
currentTLSCert, ok1 := currentSecret.Data["tls.crt"]
currentTLSKey, ok2 := currentSecret.Data["tls.key"]
currentCACert, ok3 := currentSecret.Data["ca.crt"]
if ok1 && ok2 && ok3 {
valid, _ := m.isCertValid(currentTLSCert)
if valid {
reusedCA = currentCACert
reusedTLSCert = currentTLSCert
reusedTLSKey = currentTLSKey
return nil
}
}
currentSecret.Data = newSecret.Data
currentSecret.Type = newSecret.Type
if currentSecret.Labels == nil {
currentSecret.Labels = map[string]string{}
}
for k, v := range newSecret.Labels {
currentSecret.Labels[k] = v
}
return m.client.Update(ctx, currentSecret)
})
if updateErr != nil {
return nil, fmt.Errorf("failed to update cert secret: %w", updateErr)
}
if reusedCA != nil {
m.logger.Info("Using concurrently updated webhook cert secret", "secret", m.config.SecretName)
if writeErr := m.writeCertsToDisk(reusedTLSCert, reusedTLSKey); writeErr != nil {
return nil, writeErr
}
return reusedCA, nil
}

Copilot uses AI. Check for mistakes.
- name: WEBHOOK_SERVICE_NAME
value: "{{ .Values.prefix }}-webhook-service"
- name: WEBHOOK_NAME
value: "{{ .Values.prefix }}-validating-webhook-configuration"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm sets WEBHOOK_SERVICE_NAME and WEBHOOK_NAME based on .Values.prefix, but WEBHOOK_SECRET_NAME remains the hard-coded default ("weka-operator-webhook-server-cert") from env.go. This can cause name collisions across multiple Helm releases in the same namespace and makes the Secret name inconsistent with the prefixed Service/VWC names. Consider adding .Values.webhook.secretName (defaulting to a prefix-based name) and wiring it into WEBHOOK_SECRET_NAME here.

Suggested change
value: "{{ .Values.prefix }}-validating-webhook-configuration"
value: "{{ .Values.prefix }}-validating-webhook-configuration"
- name: WEBHOOK_SECRET_NAME
value: "{{ default (printf "%s-webhook-server-cert" .Values.prefix) .Values.webhook.secretName }}"

Copilot uses AI. Check for mistakes.
@caspx caspx force-pushed the admission_controller_shared_drives_clean branch from e8ecb06 to 512f17e Compare April 20, 2026 07:36
@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @caspx's task in 27s —— View job


No changes since the previous review — see the full analysis above.

Copilot AI review requested due to automatic review settings April 20, 2026 08:19
@caspx caspx force-pushed the admission_controller_shared_drives_clean branch from 512f17e to 6e96fa6 Compare April 20, 2026 08:19
@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @caspx's task in 1m 49s —— View job


  • Gather context and check for changes
  • Read current file state
  • Compare against previous review issues
  • Post findings

Code Review Update

Good fixes landed since the last review. Here's the updated status:


Fixed ✅

Issue #1 — SAN validation in isCertValid (webhook_manager.go:293-299)
SAN check is now present and correct — returns false with a descriptive reason if the expected DNS name is absent.

Issue #3filterNewErrors struct-key comparison (wekacluster_webhook.go:80-119)
fieldErrKey{field, errType, detail} is now used instead of e.Error() string matching. The fix is correct and handles the case where only the invalid value changes.

Issue #5resolveFailurePolicy log level (webhook_manager.go:406)
Now uses m.logger.Error(nil, ...) for unknown policy values.

Copilot clock-skew comment (webhook_manager.go:117)
notBefore is backdated by 1 hour — matches the OPA cert-controller pattern.


Still Open

1. ensureCertSecret doesn't handle AlreadyExists on Createwebhook_manager.go:246-248

This is unchanged from the previous review. With one replica it won't trigger, but it's inconsistent with ensureWebhookConfiguration's race handling. If AlreadyExists, fall back to a Get and use the winner's cert.

if createErr := m.client.Create(ctx, newSecret); createErr != nil {
    if !apierrors.IsAlreadyExists(createErr) {
        return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
    }
    // Another replica won the race — re-fetch and reuse its cert
    if regetErr := m.client.Get(ctx, secretKey, secret); regetErr != nil {
        return nil, fmt.Errorf("failed to get concurrently created cert secret: %w", regetErr)
    }
    // ... validate and use secret.Data
}

Fix this →


2. WEBHOOK_SECRET_NAME not Helm-configurablecharts/weka-operator/templates/manager.yaml, values.yaml

WEBHOOK_SERVICE_NAME, WEBHOOK_NAME, and WEBHOOK_FAILURE_POLICY are all Helm-wired, but WEBHOOK_SECRET_NAME still uses the hard-coded default from env.go. Multiple Helm releases in the same namespace will collide on the secret name. Adding webhook.secretName to values.yaml (defaulting to {{ .Values.prefix }}-webhook-server-cert) and wiring it into manager.yaml would make this consistent with the other prefixed names.
Fix this →


Non-blocking reminder

No runtime cert rotation — certs are checked/rotated only at startup. A deployment running continuously will hit cert expiry after ~1 year without a restart. Worth a follow-up ticket for a background rotation goroutine.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +207 to +211
valid, reason := m.isCertValid(tlsCert)
if valid {
m.logger.Info("Using existing webhook cert secret", "secret", m.config.SecretName)
if writeErr := m.writeCertsToDisk(tlsCert, tlsKey); writeErr != nil {
return nil, writeErr
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reusing an existing cert Secret, isCertValid() only validates the leaf cert (expiry + SAN) and then returns the stored ca.crt without checking that the CA actually signs the leaf. If ca.crt and tls.crt ever get out of sync (manual edits, partial updates), the webhook TLS handshake will fail while this code still considers the Secret “valid”. Consider verifying the leaf against ca.crt (x509.Verify with a pool) before accepting the Secret, and regenerate if verification fails.

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +127
// Non-field error in the new set. If the full message didn't appear
// in old, return the whole newErr to avoid silently dropping a
// genuinely new violation.
if _, existed := oldMessages[e.Error()]; !existed {
return newErr
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterNewErrors() is documented as returning only the errors newly introduced by the update, but in the non-*field.Error branch it returns the entire newErr when it finds a new non-field error. If newErr is an aggregate containing both pre-existing field errors and a new non-field error, this will re-include the pre-existing errors in the rejection message (and makes the behavior differ from the doc). Consider collecting and returning only the new non-field error(s) (plus any new field errors) instead of returning newErr wholesale.

Copilot uses AI. Check for mistakes.
Comment on lines +427 to +431
- name: WEBHOOK_ENABLED
value: "{{ .Values.webhook.enabled }}"
- name: WEBHOOK_PORT
value: "{{ .Values.webhook.port }}"
- name: WEBHOOK_CERT_DIR
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart wires several webhook env vars unconditionally, but the PR description calls out “conditional env vars in manager.yaml”. If these are intended to be conditional, wrap this block in if .Values.webhook.enabled to avoid advertising webhook config when it’s off. Also, WEBHOOK_SECRET_NAME is supported in config/env.go but isn’t set here; without it, custom .Values.prefix installs in the same namespace will still share the same Secret name (risking collisions).

Copilot uses AI. Check for mistakes.
# -- Webhook server port
port: 9443
# -- Directory for TLS certificates
certDir: "/tmp/k8s-webhook-server/serving-certs"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values.yaml exposes webhook.enabled/port/certDir/failurePolicy, but the operator also supports WEBHOOK_SECRET_NAME and uses a cluster-scoped WEBHOOK_NAME. Consider adding these (and possibly serviceName) to values.yaml so multi-release installs can avoid Secret/name collisions and users can customize all webhook-related resource names consistently via Helm.

Suggested change
certDir: "/tmp/k8s-webhook-server/serving-certs"
certDir: "/tmp/k8s-webhook-server/serving-certs"
# -- Validating webhook configuration name
name: "weka-operator-webhook"
# -- Service name backing the webhook server
serviceName: "weka-operator-webhook-service"
# -- Secret name containing the webhook serving certificates
secretName: "weka-operator-webhook-server-cert"

Copilot uses AI. Check for mistakes.
Comment thread cmd/manager/main.go
Comment on lines +412 to +447
// 1. Create a direct (non-cached) K8s client — mgr.GetClient() is cache-backed
// and only usable after mgr.Start().
// 2. EnsureCertificates: check/generate cert Secret, write certs to disk, create
// or update the ValidatingWebhookConfiguration with the CA bundle.
// 3. Register the WekaCluster validator with the manager.
//
// When disabled:
// - Best-effort delete any leftover VWC from a previous enabled run. Runs on
// STARTUP only, never on shutdown (deleting on every shutdown would wipe
// validation during rolling updates).
func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) {
directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()})
if err != nil {
logger.Error(err, "Failed to create direct client for webhook setup")
os.Exit(1)
}
webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger)

if !webhookEnabled {
cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
webhookMgr.CleanupIfExists(cleanupCtx)
return
}

if err := webhookMgr.EnsureCertificates(ctx); err != nil {
logger.Error(err, "Failed to ensure webhook certificates")
os.Exit(1)
}

if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil {
logger.Error(err, "unable to create webhook", "webhook", "WekaCluster")
os.Exit(1)
}

logger.Info("WekaCluster validating webhook enabled")
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupWebhook() is invoked before mgr.Start(), so it runs before controller-runtime leader election takes effect. In a multi-replica deployment, every pod will attempt Secret/VWC reconciliation and cert generation concurrently at startup, which can cause races and potentially mismatched CA bundles vs serving certs across pods. Consider gating EnsureCertificates/VWC reconciliation to the elected leader (or a dedicated Lease lock), while letting all replicas serve using the shared Secret.

Suggested change
// 1. Create a direct (non-cached) K8s client — mgr.GetClient() is cache-backed
// and only usable after mgr.Start().
// 2. EnsureCertificates: check/generate cert Secret, write certs to disk, create
// or update the ValidatingWebhookConfiguration with the CA bundle.
// 3. Register the WekaCluster validator with the manager.
//
// When disabled:
// - Best-effort delete any leftover VWC from a previous enabled run. Runs on
// STARTUP only, never on shutdown (deleting on every shutdown would wipe
// validation during rolling updates).
func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) {
directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()})
if err != nil {
logger.Error(err, "Failed to create direct client for webhook setup")
os.Exit(1)
}
webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger)
if !webhookEnabled {
cleanupCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
webhookMgr.CleanupIfExists(cleanupCtx)
return
}
if err := webhookMgr.EnsureCertificates(ctx); err != nil {
logger.Error(err, "Failed to ensure webhook certificates")
os.Exit(1)
}
if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil {
logger.Error(err, "unable to create webhook", "webhook", "WekaCluster")
os.Exit(1)
}
logger.Info("WekaCluster validating webhook enabled")
// 1. Register the WekaCluster validator with the manager on every replica so
// each pod can serve webhook traffic.
// 2. Reconcile webhook certificates and the ValidatingWebhookConfiguration only
// on the elected leader, after manager leader election is active.
//
// When disabled:
// - Best-effort delete any leftover VWC from a previous enabled run. This is
// also gated by leader election so only one replica performs the cleanup.
type leaderElectionRunnable struct {
start func(context.Context) error
}
func (r leaderElectionRunnable) Start(ctx context.Context) error {
return r.start(ctx)
}
func (leaderElectionRunnable) NeedLeaderElection() bool {
return true
}
func setupWebhook(ctx context.Context, mgr ctrl.Manager, webhookEnabled bool, logger logr.Logger) {
if webhookEnabled {
if err := webhooks.SetupWekaClusterWebhookWithManager(mgr); err != nil {
logger.Error(err, "unable to create webhook", "webhook", "WekaCluster")
os.Exit(1)
}
}
if err := mgr.Add(leaderElectionRunnable{
start: func(runCtx context.Context) error {
directClient, err := client.New(mgr.GetConfig(), client.Options{Scheme: mgr.GetScheme()})
if err != nil {
logger.Error(err, "Failed to create direct client for webhook setup")
return err
}
webhookMgr := webhooks.NewWebhookManager(directClient, config.Config.Webhook, config.Config.OperatorPodNamespace, logger)
if !webhookEnabled {
cleanupCtx, cancel := context.WithTimeout(runCtx, 10*time.Second)
defer cancel()
webhookMgr.CleanupIfExists(cleanupCtx)
<-runCtx.Done()
return nil
}
if err := webhookMgr.EnsureCertificates(runCtx); err != nil {
logger.Error(err, "Failed to ensure webhook certificates")
return err
}
logger.Info("WekaCluster validating webhook enabled")
<-runCtx.Done()
return nil
},
}); err != nil {
logger.Error(err, "failed to register leader-elected webhook setup runnable")
os.Exit(1)
}

Copilot uses AI. Check for mistakes.

if !secretExists {
if createErr := m.client.Create(ctx, newSecret); createErr != nil {
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureCertSecret() has a startup race when multiple operator replicas start at once: both can observe NotFound, generate different certs, and one Create() will fail with AlreadyExists (currently returned as an error). Even worse, if you later change this to Update-on-exists, it could overwrite the secret with a different CA/cert and break other pods. Handle IsAlreadyExists on create by re-GETing the Secret and using the stored certs (or retry) instead of failing/overwriting.

Suggested change
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
if apierrors.IsAlreadyExists(createErr) {
m.logger.Info("Webhook cert secret was created concurrently; reloading existing secret", "secret", m.config.SecretName)
secret = &corev1.Secret{}
if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.SecretName, Namespace: m.namespace}, secret); getErr != nil {
return nil, fmt.Errorf("cert secret was created concurrently but failed to re-fetch it: %w", getErr)
}
} else {
return nil, fmt.Errorf("failed to create cert secret: %w", createErr)
}
} else {
secret = newSecret

Copilot uses AI. Check for mistakes.
Comment on lines +335 to +341
if apierrors.IsAlreadyExists(createErr) {
return createErr // trigger retry; another replica created it first
}
return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr)
}
m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName)
return nil
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureWebhookConfiguration() expects RetryOnConflict to retry when Create() returns AlreadyExists (another replica created the VWC first), but RetryOnConflict only retries on conflict errors; AlreadyExists typically won’t be retried and will bubble up, potentially crashing a replica on startup. Consider handling IsAlreadyExists explicitly by re-GETing the VWC and proceeding to the update path (or use a retry loop that retries on AlreadyExists as well).

Suggested change
if apierrors.IsAlreadyExists(createErr) {
return createErr // trigger retry; another replica created it first
}
return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr)
}
m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName)
return nil
if !apierrors.IsAlreadyExists(createErr) {
return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr)
}
if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.WebhookName}, existing); getErr != nil {
return fmt.Errorf("failed to get ValidatingWebhookConfiguration after concurrent create: %w", getErr)
}
} else {
m.logger.Info("Created ValidatingWebhookConfiguration", "name", m.config.WebhookName)
return nil
}

Copilot uses AI. Check for mistakes.
@caspx caspx force-pushed the admission_controller_shared_drives_clean branch from 6e96fa6 to 1a076ac Compare April 20, 2026 10:13
@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @caspx's task in 2m 10s —— View job


  • Check git diff for changes since last review
  • Read current file state for open issues
  • Compare against previous review findings
  • Post updated review

No new commits since the previous review — the branch still has one commit (1a076aca). Two issues from the last review remain open, and I've confirmed a third bug (originally flagged by Copilot) that warrants attention.


Still Open

1. ensureCertSecret doesn't handle AlreadyExists on Createwebhook_manager.go:247-249

Unchanged. If two replicas both observe NotFound and race to Create, one will return AlreadyExists which propagates up to setupWebhookos.Exit(1). The operator currently runs 1 replica so this won't trigger, but it's inconsistent with the ensureWebhookConfiguration race handling. On AlreadyExists: re-Get and use the winner's cert.


2. WEBHOOK_SECRET_NAME not Helm-configurablecharts/weka-operator/templates/manager.yaml, values.yaml

Unchanged. env.go:523 reads WEBHOOK_SECRET_NAME with default "weka-operator-webhook-server-cert", but manager.yaml only wires WEBHOOK_SERVICE_NAME, WEBHOOK_NAME, and WEBHOOK_FAILURE_POLICY. Multiple Helm releases in the same namespace will collide on the secret name.

Fix this →


3. RetryOnConflict won't actually retry on AlreadyExistswebhook_manager.go:344-346

The comment says "trigger retry; another replica created it first" but retry.RetryOnConflict uses apierrors.IsConflict as its retryable predicate — AlreadyExists fails that check and falls through to the default: return false, err arm of OnError, causing the error to bubble up unretried. In a multi-replica race this exits the operator. Fix: handle IsAlreadyExists explicitly by re-Get-ting the VWC and falling through to the update path, same as the existing update branch:

if createErr := m.client.Create(ctx, desired); createErr != nil {
    if !apierrors.IsAlreadyExists(createErr) {
        return fmt.Errorf("failed to create ValidatingWebhookConfiguration: %w", createErr)
    }
    // Another replica won the race — re-fetch and fall through to update
    if getErr := m.client.Get(ctx, client.ObjectKey{Name: m.config.WebhookName}, existing); getErr != nil {
        return fmt.Errorf("failed to get VWC after concurrent create: %w", getErr)
    }
}

Fix this →


Suggestion

4. isCertValid doesn't verify the CA signs the leafwebhook_manager.go:201-213

When reusing an existing Secret, isCertValid checks expiry and SANs but not that ca.crt actually signed tls.crt. If the two keys ever get out of sync (manual edit, partial write during a crash), the Secret looks valid but the webhook TLS handshake will fail. Consider adding an x509.Verify call with a pool built from ca.crt:

pool := x509.NewCertPool()
pool.AppendCertsFromPEM(caCert)
if _, err := cert.Verify(x509.VerifyOptions{Roots: pool}); err != nil {
    return false, fmt.Sprintf("cert not signed by stored CA: %v", err)
}

Non-blocking reminder

No runtime cert rotation — certs rotate on restart only. A deployment running continuously will hit expiry after ~1 year. Worth a follow-up ticket for a background rotation goroutine (e.g., every 12 hours).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants