Return empty metrics for unavailable BYO CNI cluster by mgoltzsche · Pull Request #8060 · kubermatic/dashboard

mgoltzsche · 2026-05-07T15:34:53Z

What this PR does / why we need it:
Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider.
This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster.

In case of an unavailable user cluster that has the none CNI plugin configured the metrics endpoints behave as follows:

The Cluster metrics endpoint returns an object with the cluster name but no metrics (the 0 default value).
The MachineDeployment metrics endpoint returns an empty array, not even listing the available Machines each with their metrics.

Also, the metrics endpoints don't log an error for every request anymore in that case but a warning.

Behaviour without this fix
Without this fix the Dashboard API server continuously logs errors like this as soon as there is a user cluster with CNI set to none and somebody browsing the detail view of that cluster within the KKP Dashboard:

KKP Dashboard API server error log

``` $ kubectl -n kubermatic logs kubermatic-api-86dd4c47dd-5fgxm | grep "the server is currently unable to handle the request" {"level":"error","time":"2026-05-07T14:50:44.661Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/machinedeployments/k8c-fba7bc-4-e3af0284/nodes/metrics"} {"level":"error","time":"2026-05-07T14:50:59.642Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:09.599Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:29.831Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/v79gv9jx5s/metrics"} {"level":"error","time":"2026-05-07T14:51:35.626Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:39.720Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:49.926Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-08T07:57:07.800Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:57:22.986Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:03.505Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:16.844Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:23.055Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:34.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:51.464Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:06.994Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:37.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:50.537Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:23.350Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:54.005Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} ```

Correspondingly, there is an error notification shown within the Dashboard UI periodically:

As a consequence of the Dashboard API server's request errors, the kubermatic-api error rate exceeds the 0.1 threshold eventually (I had to create 3 of those BYO CNI clusters and have a dashboard UI tab for each open in parallel to get there), triggering the KubermaticAPITooManyErrors alert:

Which issue(s) this PR fixes:

Fixes kubermatic/kubermatic#15801

What type of PR is this?
/kind bug

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

Cluster/machine metrics endpoints return an empty result for unavailable BYO CNI user clusters to avoid triggering KubermaticAPITooManyErrors alerts

Documentation:

NONE

Test issue:

https://github.com/kubermatic/dashboard/issues/8063

Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider. This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster. Fixes kubermatic/kubermatic#15801 Signed-off-by: Max Goltzsche <max.goltzsche@kubermatic.com>

KhizerRehan · 2026-05-08T06:38:40Z

Hi, @mgoltzsche

Can you provide screenshots what is the actual issue or something we can repro locally or in dev environment?

Copilot

Pull request overview

This PR adjusts the API’s cluster and machine deployment metrics endpoints to return empty metrics results (HTTP 200) when the user cluster is unavailable and is configured with the none CNI plugin (BYO CNI), reducing noisy/unactionable error alerts for KKP admins.

Changes:

Treat ServiceUnavailable from the metrics API as a non-fatal condition for BYO CNI (CNIPluginTypeNone) clusters by returning empty metrics.
Downgrade logging from error behavior (via HTTP error propagation) to a warning in this specific scenario.
Add unit tests covering the new BYO CNI + unavailable cluster behavior for both cluster metrics and machine deployment metrics endpoints.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
modules/api/pkg/handler/common/machine.go	Return an empty node-metrics list (instead of an error) when metrics listing fails with `ServiceUnavailable` for BYO CNI clusters.
modules/api/pkg/handler/common/machine_test.go	Add a unit test verifying machine deployment metrics return an empty list for BYO CNI + `ServiceUnavailable`.
modules/api/pkg/handler/common/cluster.go	Return a minimal `ClusterMetrics` object (name only / zero values) when metrics listing fails with `ServiceUnavailable` for BYO CNI clusters.
modules/api/pkg/handler/common/cluster_test.go	Add a unit test verifying cluster metrics returns empty metrics for BYO CNI + `ServiceUnavailable`, plus supporting fakes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>

kron4eg

/approve

kubermatic-bot · 2026-05-08T08:09:58Z

LGTM label has been added.

Details

Git tree hash: 84a101cb19d5cf1a0ad48a906e66921bad427aeb

kubermatic-bot · 2026-05-08T08:10:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kron4eg
Once this PR has been reviewed and has the lgtm label, please assign simontheleg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

modules/api/pkg/handler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mgoltzsche · 2026-05-08T08:32:14Z

@KhizerRehan I updated the PR description now, providing evidence of the problem (see the "Behaviour without this fix" section).

mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch 2 times, most recently from b686fdb to 24666c9 Compare May 7, 2026 15:58

mgoltzsche changed the title ~~Return empty metrics for unavailable BYOCNI clustr~~ Return empty metrics for unavailable BYOCNI cluster May 7, 2026

mgoltzsche changed the title ~~Return empty metrics for unavailable BYOCNI cluster~~ Return empty metrics for unavailable BYO CNI cluster May 7, 2026

mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch from 24666c9 to 247748e Compare May 7, 2026 16:04

mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch from 247748e to 769fa8c Compare May 7, 2026 16:54

KhizerRehan requested a review from Copilot May 8, 2026 06:38

Copilot started reviewing on behalf of KhizerRehan May 8, 2026 06:39 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread modules/api/pkg/handler/common/machine.go Outdated

Comment thread modules/api/pkg/handler/common/cluster.go Outdated

mgoltzsche and others added 2 commits May 8, 2026 09:50

Log the metrics server unavailability error

501d3ca

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>

Log the metrics server unavailability error

634448c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>

kron4eg approved these changes May 8, 2026

View reviewed changes

kubermatic-bot assigned kron4eg May 8, 2026

kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2026

mgoltzsche mentioned this pull request May 8, 2026

Test dashboard API metrics endpoint error handling of BYO CNI clusters - Test Release 2.31 #8063

Open

9 tasks

kubermatic-bot added test-issue/provided Denotes a PR that has a valid test issue reference. and removed test-issue/tbd Denotes a PR that needs a test issue (change) that will be created later. labels May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return empty metrics for unavailable BYO CNI cluster#8060

Return empty metrics for unavailable BYO CNI cluster#8060
mgoltzsche wants to merge 3 commits intomainfrom
return-empty-metrics-for-unavailable-byocni-cluster

mgoltzsche commented May 7, 2026 •

edited

Loading

Uh oh!

KhizerRehan commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kron4eg left a comment

Uh oh!

kubermatic-bot commented May 8, 2026

Uh oh!

kubermatic-bot commented May 8, 2026

Uh oh!

mgoltzsche commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mgoltzsche commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KhizerRehan commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

kron4eg left a comment

Choose a reason for hiding this comment

Uh oh!

kubermatic-bot commented May 8, 2026

Uh oh!

kubermatic-bot commented May 8, 2026

Uh oh!

mgoltzsche commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgoltzsche commented May 7, 2026 •

edited

Loading

mgoltzsche commented May 8, 2026 •

edited

Loading