backend/alerts: add Node Pool Management Prometheus alert rules (ARO-25967)#5112
backend/alerts: add Node Pool Management Prometheus alert rules (ARO-25967)#5112shubhadapaithankar wants to merge 1 commit into
Conversation
|
Hi @shubhadapaithankar. Thanks for your PR. I'm waiting for a Azure member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a new backend Prometheus rule file and corresponding promtool tests for node pool management alerts, then registers the rule file in observability so it is deployed.
Changes:
- Adds four backend alerts intended to cover node pool resize, scaling duration, adjust, and delete workflows.
- Adds promtool test cases for fire/no-fire behavior for each new alert.
- Registers the new backend alert rule file in
observability/observability.yaml.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
observability/observability.yaml |
Registers the new node pool alert rules for deployment. |
backend/alerts/nodepool-prometheusRule.yaml |
Defines the four new Prometheus alert rules for node pool management. |
backend/alerts/nodepool-prometheusRule_test.yaml |
Adds promtool tests for the new node pool alert rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace 'TBD' placeholder with actual backend TSG runbook URL: https://eng.ms/docs/.../hcp/troubleshooting/backend-tsg.html This follows the ARO HCP Alerting Recommendation pattern used by other backend and frontend alerts. Addresses review feedback from Simon on PR Azure#5112.
45f8650 to
511c8b0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…oolPollLatency The previous name 'ScalingDuration' was misleading - it suggests measuring end-to-end VM scaling time, but the alert actually measures backend poll cycle latency. Renamed to 'PollLatency' for clarity. Also updated description from 'duration' to 'latency' for consistency. Addresses Copilot review feedback on PR Azure#5112.
All review feedback addressed✅ Fixed issues:
Current state:
|
|
The regenerated Bicep changes 16 existing alerts from severity 4 to severity 3 (frontend, service-tag-capacity, etc.). This is a generator behavior change, not related to node pool alerts. Please separate this from the PR or confirm with the alert owners that the severity bump is intentional. |
|
Heads up: ADR-001 now includes a standard alert naming convention. User journey alerts should follow |
Simon's feedback addressed - Metrics now implementedThanks @swiencki for the thorough review! You're absolutely right - the metrics didn't exist. I've now added them to this PR. ✅ Metrics Implementation (commits f546f9a + 8c94a20)Created backend/pkg/controllers/controllerutils/operations_metrics.go which registers:
Instrumented
The alerts now reference real metrics that are actually emitted by the backend. 🔧 Alert Pattern QuestionRegarding observability-rp.yaml and multi-window multi-burn-rate patterns: These are operational infrastructure alerts (backend poll cycle health), not user journey SLO alerts. They're similar to existing backend alerts like The Jira structure has ARO-25943 "Define SLIs/SLOs" as a separate ticket. Once we define node pool SLOs there, we can add multi-window burn-rate alerts following the cluster-service pattern. For now, these simple alerts catch:
Question: Is it acceptable to keep these as simple operational alerts in observability.yaml (alongside other backend alerts), and add SLO-based alerts later when we define the SLOs in ARO-25943? 📝 TSG DocumentationNode pool-specific troubleshooting content will be added in ARO-25945 (TSGs subtask) with escalation paths documented per ADR-001. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Cursor <cursoragent@cursor.com>
8c94a20 to
bb9a787
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shubhadapaithankar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary
Adds Prometheus alerting rules for node pool management operations as part of ARO-25967.
Changes
This PR implements two alert rules for backend node pool operations:
BackendNodePoolOperationErrorRate — Fires when error rate > 5% over 1 hour
backend_failed_operations_total{type="poll_node_pool"}/backend_operations_total{type="poll_node_pool"}BackendNodePoolPollLatency — Fires when P95 latency > 2 seconds over 1 hour
backend_operations_duration_seconds_bucket{type="poll_node_pool"}Both alerts include:
Testing
make -C observability alertsRelated