Skip to content

[BUGFIX] Sync resources to all pods instead of service#412

Open
rickardsjp wants to merge 3 commits into
perses:mainfrom
rickardsjp:fix-multiple-replicas
Open

[BUGFIX] Sync resources to all pods instead of service#412
rickardsjp wants to merge 3 commits into
perses:mainfrom
rickardsjp:fix-multiple-replicas

Conversation

@rickardsjp
Copy link
Copy Markdown
Contributor

Description

For file-based storage with multiple replicas, push resources to each pod individually rather than through the ClusterIP Service. The operator now lists ready pods by label selector, creates an HTTP client per pod IP, and syncs to all of them with best-effort error aggregation.

Closes: #409

Type of change

  • FEATURE (non-breaking change which adds functionality)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • BREAKINGCHANGE (fix or feature that would cause existing functionality to not work as expected)
  • DOC (documentation only)
  • IGNORE (tooling, build system, CI, etc.)

Verification

  • Unit tests added/updated
  • Integration tests added/updated
  • E2E tests added/updated
  • Manual testing performed

Checklist

  • Pull request has a descriptive title and context useful to a reviewer
  • Code follows project conventions and passes linting
  • All commits have DCO signoffs

For file-based storage with multiple replicas, the operator must push
resources to each pod individually rather than through the Service.
This adds the factory method that lists ready pods and creates a client
per pod IP.

Signed-off-by: Jeremy Rickards <jeremy.rickards@sap.com>
Refactor dashboard, datasource, and globaldatasource controllers to use
CreateClientsForAllPods, pushing resources to each ready pod individually
rather than through the round-robin Service.

Signed-off-by: Jeremy Rickards <jeremy.rickards@sap.com>
Signed-off-by: Jeremy Rickards <jeremy.rickards@sap.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes multi-replica file-based Perses deployments by syncing dashboards/datasources/globaldatasources to each ready Perses pod directly, rather than sending requests through the ClusterIP Service (which load-balances to a single replica).

Changes:

  • Add a CreateClientsForAllPods API in the Perses client factory to list ready pods and create one REST client per pod endpoint.
  • Update Dashboard/Datasource/GlobalDatasource controllers to sync/delete resources across all returned clients with best-effort error aggregation.
  • Expand controller RBAC markers to allow listing pods.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
internal/perses/common/perses_client_factory.go Adds multi-pod client creation by listing ready pods and building per-pod REST clients.
controllers/dashboards/persesdashboard_controller.go Adds pod list/watch RBAC marker for dashboard controller.
controllers/dashboards/dashboard_controller.go Syncs/deletes dashboards across all pod clients instead of a single service client.
controllers/datasources/persesdatasource_controller.go Adds pod list/watch RBAC marker for datasource controller.
controllers/datasources/datasource_controller.go Syncs/deletes datasources across all pod clients instead of a single service client.
controllers/globaldatasources/persesglobaldatasource_controller.go Adds pod list/watch RBAC marker for global datasource controller.
controllers/globaldatasources/globaldatasource_controller.go Syncs/deletes global datasources across all pod clients instead of a single service client.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if !isPodReady(pod) {
continue
}
urlStr := fmt.Sprintf("%s://%s:%d%s", httpProtocol, pod.Status.PodIP, containerPort, perses.Spec.Config.APIPrefix)
Comment on lines 16 to 23
import (
"context"
"flag"
"fmt"
"os"

corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
Comment on lines +105 to +107
if len(errs) > 0 {
return subreconciler.RequeueWithErrorAndReason(errors.Join(errs...), persescommon.ReasonBackendError)
}
Comment on lines 142 to 147
if persescommon.HasSecretConfig(datasource.Spec.Client) {
_, reason, err := r.syncPersesSecret(ctx, persesClient, datasource)
_, _, err := r.syncPersesSecret(ctx, persesClient, datasource)
if err != nil {
dlog.WithError(err).Errorf("Failed to create datasource secret: %s", datasource.Name)
return subreconciler.RequeueWithErrorAndReason(err, reason)
return err
}
Comment on lines +103 to 105
if len(errs) > 0 {
return subreconciler.RequeueWithErrorAndReason(errors.Join(errs...), persescommon.ReasonBackendError)
}
Comment on lines 112 to 117
if persescommon.HasSecretConfig(globaldatasource.Spec.Client) {
_, reason, err := r.syncPersesGlobalSecret(ctx, persesClient, globaldatasource)
_, _, err := r.syncPersesGlobalSecret(ctx, persesClient, globaldatasource)
if err != nil {
gdlog.WithError(err).Errorf("Failed to create globaldatasource secret: %s", globaldatasource.Name)
return subreconciler.RequeueWithErrorAndReason(err, reason)
return err
}
Comment on lines +160 to +164
func (f *PersesClientFactoryWithConfig) CreateClientsForAllPods(ctx context.Context, k8sClient client.Reader, perses persesv1alpha2.Perses) ([]v1.ClientInterface, error) {
if perses.Spec.Config.Database.SQL != nil {
c, err := f.CreateClient(ctx, k8sClient, perses)
if err != nil {
return nil, err
@slashpai
Copy link
Copy Markdown
Member

slashpai commented Jun 4, 2026

How would this work with TLS enabled? If Perses TLS certs are issued for the Service DNS, direct pod ip calls might fail validation IIUC.

Also if we scale the replicas till a reconciliation is triggered new replica won't get the updates right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resources not synced to all pods when using file-based storage with multiple replicas

3 participants