Skip to content

Return empty metrics for unavailable BYO CNI cluster#8060

Open
mgoltzsche wants to merge 3 commits intomainfrom
return-empty-metrics-for-unavailable-byocni-cluster
Open

Return empty metrics for unavailable BYO CNI cluster#8060
mgoltzsche wants to merge 3 commits intomainfrom
return-empty-metrics-for-unavailable-byocni-cluster

Conversation

@mgoltzsche
Copy link
Copy Markdown
Contributor

@mgoltzsche mgoltzsche commented May 7, 2026

What this PR does / why we need it:
Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider.
This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster.

In case of an unavailable user cluster that has the none CNI plugin configured the metrics endpoints behave as follows:

  • The Cluster metrics endpoint returns an object with the cluster name but no metrics (the 0 default value).
  • The MachineDeployment metrics endpoint returns an empty array, not even listing the available Machines each with their metrics.

Also, the metrics endpoints don't log an error for every request anymore in that case but a warning.

Behaviour without this fix
Without this fix the Dashboard API server continuously logs errors like this as soon as there is a user cluster with CNI set to none and somebody browsing the detail view of that cluster within the KKP Dashboard:

KKP Dashboard API server error log ``` $ kubectl -n kubermatic logs kubermatic-api-86dd4c47dd-5fgxm | grep "the server is currently unable to handle the request" {"level":"error","time":"2026-05-07T14:50:44.661Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/machinedeployments/k8c-fba7bc-4-e3af0284/nodes/metrics"} {"level":"error","time":"2026-05-07T14:50:59.642Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:09.599Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:29.831Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/v79gv9jx5s/metrics"} {"level":"error","time":"2026-05-07T14:51:35.626Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:39.720Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:49.926Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-08T07:57:07.800Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:57:22.986Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:03.505Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:16.844Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:23.055Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:34.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:51.464Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:06.994Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:37.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:50.537Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:23.350Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:54.005Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} ```

Correspondingly, there is an error notification shown within the Dashboard UI periodically:
grafik

As a consequence of the Dashboard API server's request errors, the kubermatic-api error rate exceeds the 0.1 threshold eventually (I had to create 3 of those BYO CNI clusters and have a dashboard UI tab for each open in parallel to get there), triggering the KubermaticAPITooManyErrors alert:
grafik
grafik

Which issue(s) this PR fixes:

Fixes kubermatic/kubermatic#15801

What type of PR is this?
/kind bug

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

Cluster/machine metrics endpoints return an empty result for unavailable BYO CNI user clusters to avoid triggering KubermaticAPITooManyErrors alerts

Documentation:

NONE

Test issue:

https://github.com/kubermatic/dashboard/issues/8063

@kubermatic-bot kubermatic-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. test-issue/tbd Denotes a PR that needs a test issue (change) that will be created later. docs/none Denotes a PR that doesn't need documentation (changes). kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. sig/api Denotes a PR or issue as being assigned to SIG API. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 7, 2026
@mgoltzsche mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch 2 times, most recently from b686fdb to 24666c9 Compare May 7, 2026 15:58
@mgoltzsche mgoltzsche changed the title Return empty metrics for unavailable BYOCNI clustr Return empty metrics for unavailable BYOCNI cluster May 7, 2026
@mgoltzsche mgoltzsche changed the title Return empty metrics for unavailable BYOCNI cluster Return empty metrics for unavailable BYO CNI cluster May 7, 2026
@mgoltzsche mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch from 24666c9 to 247748e Compare May 7, 2026 16:04
Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider.
This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster.

Fixes kubermatic/kubermatic#15801

Signed-off-by: Max Goltzsche <max.goltzsche@kubermatic.com>
@mgoltzsche mgoltzsche force-pushed the return-empty-metrics-for-unavailable-byocni-cluster branch from 247748e to 769fa8c Compare May 7, 2026 16:54
@KhizerRehan
Copy link
Copy Markdown
Contributor

Hi, @mgoltzsche

Can you provide screenshots what is the actual issue or something we can repro locally or in dev environment?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the API’s cluster and machine deployment metrics endpoints to return empty metrics results (HTTP 200) when the user cluster is unavailable and is configured with the none CNI plugin (BYO CNI), reducing noisy/unactionable error alerts for KKP admins.

Changes:

  • Treat ServiceUnavailable from the metrics API as a non-fatal condition for BYO CNI (CNIPluginTypeNone) clusters by returning empty metrics.
  • Downgrade logging from error behavior (via HTTP error propagation) to a warning in this specific scenario.
  • Add unit tests covering the new BYO CNI + unavailable cluster behavior for both cluster metrics and machine deployment metrics endpoints.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
modules/api/pkg/handler/common/machine.go Return an empty node-metrics list (instead of an error) when metrics listing fails with ServiceUnavailable for BYO CNI clusters.
modules/api/pkg/handler/common/machine_test.go Add a unit test verifying machine deployment metrics return an empty list for BYO CNI + ServiceUnavailable.
modules/api/pkg/handler/common/cluster.go Return a minimal ClusterMetrics object (name only / zero values) when metrics listing fails with ServiceUnavailable for BYO CNI clusters.
modules/api/pkg/handler/common/cluster_test.go Add a unit test verifying cluster metrics returns empty metrics for BYO CNI + ServiceUnavailable, plus supporting fakes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/api/pkg/handler/common/machine.go Outdated
Comment thread modules/api/pkg/handler/common/cluster.go Outdated
mgoltzsche and others added 2 commits May 8, 2026 09:50
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kron4eg kron4eg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2026
@kubermatic-bot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 84a101cb19d5cf1a0ad48a906e66921bad427aeb

@kubermatic-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kron4eg
Once this PR has been reviewed and has the lgtm label, please assign simontheleg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mgoltzsche
Copy link
Copy Markdown
Contributor Author

mgoltzsche commented May 8, 2026

@KhizerRehan I updated the PR description now, providing evidence of the problem (see the "Behaviour without this fix" section).

@kubermatic-bot kubermatic-bot added test-issue/provided Denotes a PR that has a valid test issue reference. and removed test-issue/tbd Denotes a PR that needs a test issue (change) that will be created later. labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. docs/none Denotes a PR that doesn't need documentation (changes). kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api Denotes a PR or issue as being assigned to SIG API. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. test-issue/provided Denotes a PR that has a valid test issue reference.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

User cluster with BYOCNI triggers the "KubermaticAPITooManyErrors" alerts.

5 participants