Using the descheduler with PSI metrics

Important

This repository is not inteded to be used in production environments. This repository is containing files to evaluate the PSI integration into the descheduler only. For a production level integration please monitor the OpenShift Documentation in order to understand when this feature GAs.

Using the descheduler with PSI metrics

PSI metrics expose node level pressure (or even cgroup). These metrics can be leveraged by the descheduler in order to rebalance workloads according to the real node usage.

User stories

As a cluster administrator I want to ensure that all my nodes are equally utilized in order to avoid pressure for the individual workloads

Scope & Limitations

Limited to Virtual Machines run with OpenShift Virtualization
Limited to worker machine pools (PSI metrics are needed also for master nodes)

Usage

Note

Use a cluster with at least 6 worker nodes
The cluster should have no other workloads running

Note

Two alternative test scenarios are available, replace TEST_SCENARIO=1 with TEST_SCENARIO=2 to switch to the second one.

$ bash to.sh deploy
...
$ bash to.sh apply
...
$ TEST_SCENARIO=1 bash e2e-test.sh
...
$

Deployment walk through

Running bash to.sh deploy will deploy all necessary parts. In this section we are looking what exactly is getting deployed.

Reconfiguration of the machine pools to enable PSI metrics at Kernel level and expose them via the node_exporter

oc apply -f manifests/10-mc-psi-controlplane.yaml
oc apply -f manifests/11-mc-psi-worker.yaml
oc apply -f manifests/12-mc-schedstats-worker.yaml

Deploy the Descheduler Operator and OpenShift Virtualization

oc apply -f manifests/20-namespaces.yaml
oc apply -f manifests/30-operatorgroup.yaml
oc apply -f manifests/31-subscriptions.yaml

until _oc apply -f manifests/40-cnv-operator-cr.yaml ; do echo -n . sleep 6 ; done
until _oc apply -f manifests/41-descheduler-operator-cr.yaml ; do echo -n . sleep 6 ; done

oc adm policy add-cluster-role-to-user cluster-monitoring-view -z $SA -n $NS  # for desched metrics

Deploy the node tainting component

oc create -n $NS configmap desched-taint --from-file contrib/desched-taint.sh
oc apply -n $NS -f manifests/50-desched-taint.yaml
oc adm policy add-cluster-role-to-user system:controller:node-controller -z $SA -n $NS" # for tainter

Monitoring

Two dashboards are available for monitoring the descheduler behaviour.

Load Aware ReBalancing (Grafana)

A Grafana dashboard for the PSI-based load-aware rebalancing profile. See monitoring/README.md for deployment instructions.

Memory Aware Rebalancing (Perses — on-cluster via COO)

The dashboard can be deployed directly onto the cluster using the Cluster Observability Operator (COO) PersesDashboard CRD. This makes it available inside the OpenShift Console under Observe → Dashboards.

Prerequisites: COO ≥ v1.3 installed (Perses CRDs present), jq.

Step 1 — Enable Perses in the COO UIPlugin

If no UIPlugin exists yet, create one:

$ oc apply -f - <<'EOF'
apiVersion: observability.openshift.io/v1alpha1
kind: UIPlugin
metadata:
  name: monitoring
spec:
  type: Monitoring
  monitoring:
    perses:
      enabled: true
EOF

If a monitoring UIPlugin already exists (e.g. created by another component), patch it instead:

$ oc patch uiplugins.observability.openshift.io monitoring \
    --type=merge \
    -p '{"spec":{"monitoring":{"perses":{"enabled":true}}}}'

Wait until the Perses server is up:

$ oc rollout status statefulset/perses \
    -n openshift-cluster-observability-operator

Step 2 — Create a namespace and grant Thanos access

$ oc new-project descheduler-monitoring

$ oc create serviceaccount perses-datasource-sa -n descheduler-monitoring

$ oc adm policy add-cluster-role-to-user cluster-monitoring-view \
    -z perses-datasource-sa -n descheduler-monitoring

Step 3 — Create the bearer-token secret

$ oc create secret generic thanos-querier-datasource-secret \
    -n descheduler-monitoring \
    --from-literal=token="$(
        oc create token perses-datasource-sa \
            -n descheduler-monitoring \
            --duration=8760h
    )"

Note

oc create token issues a short-lived token bounded to the ServiceAccount. Rotate it by re-running Step 3 and Step 5 below.

Step 4 — Create the PersesDatasource

$ oc apply -f - <<'EOF'
apiVersion: perses.dev/v1alpha1
kind: PersesDatasource
metadata:
  name: thanos-querier
  namespace: descheduler-monitoring
spec:
  config:
    default: true
    display:
      name: Thanos Querier
    plugin:
      kind: PrometheusDatasource
      spec:
        proxy:
          kind: HTTPProxy
          spec:
            url: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
            secret: thanos-querier-datasource-secret
  client:
    tls:
      enable: true
      caCert:
        type: file
        certPath: /ca/service-ca.crt
EOF

Step 5 — Register the bearer token with the Perses server

The perses-operator reconciles the PersesDatasource CR into the Perses server's internal file database, but it only stores the TLS configuration there. The bearer token must be pushed directly to the Perses API so that the server can attach it to outgoing Thanos requests.

$ oc port-forward -n openshift-cluster-observability-operator \
    svc/perses 18443:8080 &
$ PF_PID=$!

$ OC_TOKEN=$(oc whoami -t)
$ BEARER=$(oc get secret thanos-querier-datasource-secret \
    -n descheduler-monitoring \
    -o jsonpath='{.data.token}' | base64 -d)

$ curl -sk -X POST \
    -H "Authorization: Bearer $OC_TOKEN" \
    -H "Content-Type: application/json" \
    https://localhost:18443/api/v1/projects/descheduler-monitoring/secrets \
    -d "{
      \"kind\": \"Secret\",
      \"metadata\": {
        \"name\": \"thanos-querier-datasource-secret\",
        \"project\": \"descheduler-monitoring\"
      },
      \"spec\": {
        \"authorization\": {\"type\": \"Bearer\", \"credentials\": \"$BEARER\"},
        \"tlsConfig\": {\"caFile\": \"/ca/service-ca.crt\"}
      }
    }"

$ kill $PF_PID

Note

The Perses server stores its database on a PersistentVolumeClaim, so this secret survives pod restarts. Re-run this step whenever the bearer token is rotated (Step 3).

Step 6 — Deploy the dashboard

$ jq '{
    apiVersion: "perses.dev/v1alpha1",
    kind: "PersesDashboard",
    metadata: {
      name: "memory-aware-rebalancing",
      namespace: "descheduler-monitoring"
    },
    spec: .spec
  }' monitoring/perses/provisioning/memory_aware_rebalancing.json \
  | oc apply -f -

The dashboard appears in the OpenShift Console under Observe → Dashboards → memory-aware-rebalancing within a few seconds.

To update the dashboard after editing the provisioning JSON, re-run Step 6.

Memory Aware Rebalancing (Perses — local)

A Perses dashboard focused on the memory-aware aspects: synthetic utilization values, dynamic thresholds, PSI pressure, node classification over time, and evictions.

The stack runs locally in containers and proxies to the remote cluster's Thanos querier, so no cluster-side deployment is required.

Prerequisites: podman (or docker) with compose support, and an active oc login to the target cluster.

$ cd monitoring/perses
$ KUBECONFIG=/path/to/kubeconfig ./start.sh start

Open http://localhost:8080 and navigate to Projects → descheduler → Dashboards → memory-aware-rebalancing.

To stop the stack:

$ ./start.sh stop

To refresh an expired token (tokens are short-lived), simply re-run ./start.sh start — it regenerates nginx.conf with the new token and restarts the proxy container.

What the dashboard shows

Section	Panels
Utilization	CPU utilization per node + fleet average (★); Memory utilization per node + fleet average
Pressure	CPU PSI pressure per node + fleet average; Memory PSI pressure per node + fleet average
Synthetic utilization value & thresholds	Per-node descheduler score with dynamic high (red dashed) and low (orange dashed) threshold bands
Node classification	`StatusHistoryChart` — Underutilized / Normal / Overutilized per node over time
Evictions	Total counter + per-node time series for `KubeVirtRelieveAndMigrate`
Detailed metrics (collapsed)	CPU & memory pressure and utilization, each per node with fleet average overlaid

All panels are filterable by node via the Node variable at the top of the dashboard.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
contrib		contrib
manifests		manifests
monitoring		monitoring
tests		tests
.gitignore		.gitignore
README.md		README.md
e2e-test.sh		e2e-test.sh
test_scenario_1.sh		test_scenario_1.sh
test_scenario_2.sh		test_scenario_2.sh
to.sh		to.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using the descheduler with PSI metrics

User stories

Scope & Limitations

Usage

Deployment walk through

Monitoring

Load Aware ReBalancing (Grafana)

Memory Aware Rebalancing (Perses — on-cluster via COO)

Step 1 — Enable Perses in the COO UIPlugin

Step 2 — Create a namespace and grant Thanos access

Step 3 — Create the bearer-token secret

Step 4 — Create the PersesDatasource

Step 5 — Register the bearer token with the Perses server

Step 6 — Deploy the dashboard

Memory Aware Rebalancing (Perses — local)

What the dashboard shows

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Using the descheduler with PSI metrics

User stories

Scope & Limitations

Usage

Deployment walk through

Monitoring

Load Aware ReBalancing (Grafana)

Memory Aware Rebalancing (Perses — on-cluster via COO)

Step 1 — Enable Perses in the COO UIPlugin

Step 2 — Create a namespace and grant Thanos access

Step 3 — Create the bearer-token secret

Step 4 — Create the PersesDatasource

Step 5 — Register the bearer token with the Perses server

Step 6 — Deploy the dashboard

Memory Aware Rebalancing (Perses — local)

What the dashboard shows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages