Return empty metrics for unavailable BYO CNI cluster#8060
Return empty metrics for unavailable BYO CNI cluster#8060mgoltzsche wants to merge 3 commits intomainfrom
Conversation
b686fdb to
24666c9
Compare
24666c9 to
247748e
Compare
Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider. This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster. Fixes kubermatic/kubermatic#15801 Signed-off-by: Max Goltzsche <max.goltzsche@kubermatic.com>
247748e to
769fa8c
Compare
|
Hi, @mgoltzsche Can you provide screenshots what is the actual issue or something we can repro locally or in dev environment? |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the API’s cluster and machine deployment metrics endpoints to return empty metrics results (HTTP 200) when the user cluster is unavailable and is configured with the none CNI plugin (BYO CNI), reducing noisy/unactionable error alerts for KKP admins.
Changes:
- Treat
ServiceUnavailablefrom the metrics API as a non-fatal condition for BYO CNI (CNIPluginTypeNone) clusters by returning empty metrics. - Downgrade logging from error behavior (via HTTP error propagation) to a warning in this specific scenario.
- Add unit tests covering the new BYO CNI + unavailable cluster behavior for both cluster metrics and machine deployment metrics endpoints.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| modules/api/pkg/handler/common/machine.go | Return an empty node-metrics list (instead of an error) when metrics listing fails with ServiceUnavailable for BYO CNI clusters. |
| modules/api/pkg/handler/common/machine_test.go | Add a unit test verifying machine deployment metrics return an empty list for BYO CNI + ServiceUnavailable. |
| modules/api/pkg/handler/common/cluster.go | Return a minimal ClusterMetrics object (name only / zero values) when metrics listing fails with ServiceUnavailable for BYO CNI clusters. |
| modules/api/pkg/handler/common/cluster_test.go | Add a unit test verifying cluster metrics returns empty metrics for BYO CNI + ServiceUnavailable, plus supporting fakes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Max Goltzsche <mgoltzsche@users.noreply.github.com>
|
LGTM label has been added. DetailsGit tree hash: 84a101cb19d5cf1a0ad48a906e66921bad427aeb |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kron4eg The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@KhizerRehan I updated the PR description now, providing evidence of the problem (see the "Behaviour without this fix" section). |
What this PR does / why we need it:
Let the API return empty node/cluster metrics when the corresponding user cluster is unavailable and configured to use a custom CNI provider.
This is to avoid bothering KKP admins with alerts that are not actionable for them since user cluster admins need to set up CNI themselves if they disabled managed CNI for their user cluster.
In case of an unavailable user cluster that has the
noneCNI plugin configured the metrics endpoints behave as follows:0default value).Also, the metrics endpoints don't log an error for every request anymore in that case but a warning.
Behaviour without this fix
Without this fix the Dashboard API server continuously logs errors like this as soon as there is a user cluster with CNI set to
noneand somebody browsing the detail view of that cluster within the KKP Dashboard:KKP Dashboard API server error log
``` $ kubectl -n kubermatic logs kubermatic-api-86dd4c47dd-5fgxm | grep "the server is currently unable to handle the request" {"level":"error","time":"2026-05-07T14:50:44.661Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/machinedeployments/k8c-fba7bc-4-e3af0284/nodes/metrics"} {"level":"error","time":"2026-05-07T14:50:59.642Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:09.599Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:29.831Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/v79gv9jx5s/metrics"} {"level":"error","time":"2026-05-07T14:51:35.626Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:39.720Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-07T14:51:49.926Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/e2e-j9f44/clusters/k8c-fba7bc-412b0b-1-35-2/metrics"} {"level":"error","time":"2026-05-08T07:57:07.800Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:57:22.986Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:03.505Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:16.844Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:23.055Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:34.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:58:51.464Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:06.994Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:37.974Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T07:59:50.537Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:23.350Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} {"level":"error","time":"2026-05-08T08:00:54.005Z","caller":"handler/routing.go:152","msg":"the server is currently unable to handle the request","request":"/api/v2/projects/qtwrpljtkm/clusters/gscmldh9c2/metrics"} ```Correspondingly, there is an error notification shown within the Dashboard UI periodically:

As a consequence of the Dashboard API server's request errors, the kubermatic-api error rate exceeds the 0.1 threshold eventually (I had to create 3 of those BYO CNI clusters and have a dashboard UI tab for each open in parallel to get there), triggering the KubermaticAPITooManyErrors alert:


Which issue(s) this PR fixes:
Fixes kubermatic/kubermatic#15801
What type of PR is this?
/kind bug
Special notes for your reviewer:
Does this PR introduce a user-facing change? Then add your Release Note here:
Documentation:
Test issue: