Skip to content

backend/alerts: add Node Pool Management Prometheus alert rules (ARO-25967)#5112

Open
shubhadapaithankar wants to merge 1 commit into
Azure:mainfrom
shubhadapaithankar:aro-25967-nodepool-alerting
Open

backend/alerts: add Node Pool Management Prometheus alert rules (ARO-25967)#5112
shubhadapaithankar wants to merge 1 commit into
Azure:mainfrom
shubhadapaithankar:aro-25967-nodepool-alerting

Conversation

@shubhadapaithankar
Copy link
Copy Markdown
Contributor

@shubhadapaithankar shubhadapaithankar commented May 4, 2026

Summary

Adds Prometheus alerting rules for node pool management operations as part of ARO-25967.

Changes

This PR implements two alert rules for backend node pool operations:

  1. BackendNodePoolOperationErrorRate — Fires when error rate > 5% over 1 hour

    • Monitors backend_failed_operations_total{type="poll_node_pool"} / backend_operations_total{type="poll_node_pool"}
    • Threshold: 5% error rate sustained for 5 minutes
  2. BackendNodePoolPollLatency — Fires when P95 latency > 2 seconds over 1 hour

    • Monitors backend_operations_duration_seconds_bucket{type="poll_node_pool"}
    • Measures backend poll cycle latency (not end-to-end VM scaling time)
    • Threshold: P95 > 2 seconds sustained for 15 minutes

Both alerts include:

  • Runbook URL pointing to backend TSG
  • Metric source documentation in annotations
  • Promtool unit tests with multi-pod scenarios

Testing

  • ✅ Promtool tests pass for both alerts
  • ✅ Bicep templates regenerated via make -C observability alerts
  • ✅ Alert configuration synced with latest main

Related

Copilot AI review requested due to automatic review settings May 4, 2026 18:16
@openshift-ci openshift-ci Bot requested review from janboll and mmazur May 4, 2026 18:16
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 4, 2026

Hi @shubhadapaithankar. Thanks for your PR.

I'm waiting for a Azure member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a new backend Prometheus rule file and corresponding promtool tests for node pool management alerts, then registers the rule file in observability so it is deployed.

Changes:

  • Adds four backend alerts intended to cover node pool resize, scaling duration, adjust, and delete workflows.
  • Adds promtool test cases for fire/no-fire behavior for each new alert.
  • Registers the new backend alert rule file in observability/observability.yaml.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

File Description
observability/observability.yaml Registers the new node pool alert rules for deployment.
backend/alerts/nodepool-prometheusRule.yaml Defines the four new Prometheus alert rules for node pool management.
backend/alerts/nodepool-prometheusRule_test.yaml Adds promtool tests for the new node pool alert rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Copilot AI review requested due to automatic review settings May 4, 2026 18:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Copilot AI review requested due to automatic review settings May 4, 2026 21:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread observability/observability.yaml Outdated
Comment thread observability/observability.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Copilot AI review requested due to automatic review settings May 4, 2026 23:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings May 4, 2026 23:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread observability/observability.yaml
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule_test.yaml Outdated
shubhadapaithankar added a commit to shubhadapaithankar/ARO-HCP that referenced this pull request May 4, 2026
Replace 'TBD' placeholder with actual backend TSG runbook URL:
https://eng.ms/docs/.../hcp/troubleshooting/backend-tsg.html

This follows the ARO HCP Alerting Recommendation pattern used by
other backend and frontend alerts.

Addresses review feedback from Simon on PR Azure#5112.
@shubhadapaithankar shubhadapaithankar force-pushed the aro-25967-nodepool-alerting branch from 45f8650 to 511c8b0 Compare May 5, 2026 16:03
Copilot AI review requested due to automatic review settings May 5, 2026 16:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
shubhadapaithankar added a commit to shubhadapaithankar/ARO-HCP that referenced this pull request May 5, 2026
…oolPollLatency

The previous name 'ScalingDuration' was misleading - it suggests measuring
end-to-end VM scaling time, but the alert actually measures backend poll
cycle latency.

Renamed to 'PollLatency' for clarity. Also updated description from
'duration' to 'latency' for consistency.

Addresses Copilot review feedback on PR Azure#5112.
@shubhadapaithankar
Copy link
Copy Markdown
Contributor Author

shubhadapaithankar commented May 5, 2026

All review feedback addressed

✅ Fixed issues:

  1. Rebased on latest main - No longer dropping maestro, kubeContainerOOM, kubeNode, etc. alerts (commit 511c8b0)

  2. Updated runbook URLs - Replaced 'TBD' with actual backend TSG URL matching other backend alerts (commit 45f8650352e903)

  3. Renamed misleading alert - Changed BackendNodePoolScalingDurationBackendNodePoolPollLatency to clarify this measures backend poll cycles, not end-to-end VM scaling time (commit 352e903)

  4. Updated PR description - Now correctly states 2 alerts (not 4) and reflects the current implementation

  5. Added metric source documentation - All alerts include "Metrics emitted from backend/operations_scanner.go" in annotations

Current state:

  • 2 alerts: BackendNodePoolOperationErrorRate and BackendNodePoolPollLatency
  • Tests: Full promtool coverage for both alerts
  • Bicep: Regenerated and synced with main
  • observability.yaml: Only adds nodepool registration, no unrelated changes

Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread backend/alerts/nodepool-prometheusRule.yaml Outdated
Comment thread observability/observability.yaml Outdated
@swiencki
Copy link
Copy Markdown
Collaborator

swiencki commented May 5, 2026

The regenerated Bicep changes 16 existing alerts from severity 4 to severity 3 (frontend, service-tag-capacity, etc.). This is a generator behavior change, not related to node pool alerts. Please separate this from the PR or confirm with the alert owners that the severity bump is intentional.

@swiencki
Copy link
Copy Markdown
Collaborator

swiencki commented May 5, 2026

Heads up: ADR-001 now includes a standard alert naming convention. User journey alerts should follow {Scope}{Subject}{Metric}{BurnRateTier}, e.g. UJNodePoolErrors1h5m. See the "Alert naming convention" section in the ADR for the full spec and examples.

Copilot AI review requested due to automatic review settings May 5, 2026 20:47
@shubhadapaithankar
Copy link
Copy Markdown
Contributor Author

shubhadapaithankar commented May 5, 2026

Simon's feedback addressed - Metrics now implemented

Thanks @swiencki for the thorough review! You're absolutely right - the metrics didn't exist. I've now added them to this PR.

✅ Metrics Implementation (commits f546f9a + 8c94a20)

Created backend/pkg/controllers/controllerutils/operations_metrics.go which registers:

  • backend_operations_total{type="poll_node_pool"} - Counter of total operations
  • backend_failed_operations_total{type="poll_node_pool"} - Counter of failed operations
  • backend_operations_duration_seconds{type="poll_node_pool"} - Histogram of operation latency

Instrumented backend/pkg/controllers/nodepoolpropertiescontroller/node_pool_properties_sync.go:

  • Tracks each node pool sync operation
  • Records success/failure based on database and cluster service errors
  • Measures duration for latency monitoring

The alerts now reference real metrics that are actually emitted by the backend.

🔧 Alert Pattern Question

Regarding observability-rp.yaml and multi-window multi-burn-rate patterns:

These are operational infrastructure alerts (backend poll cycle health), not user journey SLO alerts. They're similar to existing backend alerts like BackendControllerQueueDepthHigh and BackendControllerPanic - simple threshold alerts for operational issues.

The Jira structure has ARO-25943 "Define SLIs/SLOs" as a separate ticket. Once we define node pool SLOs there, we can add multi-window burn-rate alerts following the cluster-service pattern.

For now, these simple alerts catch:

  • Backend failing to sync node pool state (error rate > 5%)
  • Backend poll operations getting slow (P95 > 2s)

Question: Is it acceptable to keep these as simple operational alerts in observability.yaml (alongside other backend alerts), and add SLO-based alerts later when we define the SLOs in ARO-25943?

📝 TSG Documentation

Node pool-specific troubleshooting content will be added in ARO-25945 (TSGs subtask) with escalation paths documented per ADR-001.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Comment thread dev-infrastructure/modules/metrics/rules/generatedPrometheusAlertingRules.bicep Outdated
Co-authored-by: Cursor <cursoragent@cursor.com>
@shubhadapaithankar shubhadapaithankar force-pushed the aro-25967-nodepool-alerting branch from 8c94a20 to bb9a787 Compare May 30, 2026 00:53
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shubhadapaithankar
Once this PR has been reviewed and has the lgtm label, please assign weherdh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants