Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948) by janboll · Pull Request #5390 · Azure/ARO-HCP

janboll · 2026-05-26T13:20:20Z

https://redhat.atlassian.net/browse/AROSLSRE-948

What/Why

Tune remote_write queue: maxSamplesPerSend 2000→10000, capacity 2500→50000, maxShards 500→50
Add configurable purgeWAL init container to clear WAL before startup
Remove CPU limits for Prometheus to allow burst during replay

To promote this change enable purgeWAL in integration and stage

- Tune remote_write queue: maxSamplesPerSend 2000→10000, capacity 2500→50000, maxShards 500→50 - Add configurable purgeWAL init container to clear WAL before startup - Remove CPU limits for Prometheus to allow burst during replay Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-05-26T13:20:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janboll

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [janboll]
~~dev-infrastructure/OWNERS~~ [janboll]
~~observability/OWNERS~~ [janboll]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR targets reducing PrometheusAgent crash loops during WAL replay by loosening CPU throttling, adding an optional WAL purge initContainer, and adjusting remote_write queue settings.

Changes:

Removed Prometheus CPU limits in svc/mgmt/opstool values and in the PrometheusAgent template rendering.
Added a new prometheusSpec.purgeWAL boolean (config + schema + rendered configs) that conditionally injects an initContainer to delete the WAL before startup.
Updated remote_write queueConfig defaults in the PrometheusAgent template and refreshed Helm test/rendered fixtures to match.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
observability/prometheus/values-svc.yaml	Adds `purgeWAL` and removes CPU limit from svc Prometheus resources.
observability/prometheus/values-opstool.yaml	Removes CPU limit from opstool Prometheus resources.
observability/prometheus/values-mgmt.yaml	Adds `purgeWAL` and removes CPU limit from mgmt Prometheus resources.
observability/prometheus/deploy/templates/prometheus.yaml	Stops rendering CPU limits, adds conditional WAL purge initContainer, adjusts remote_write queueConfig.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources.yaml	Updates expected rendered output for svc (no CPU limit, new queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources_unset.yaml	Updates expected rendered output for svc (queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml	Updates expected rendered output for mgmt (no CPU limit, new queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml	Updates expected rendered output for mgmt (queueConfig values).
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_svc_1_arohcp_monitor.yaml	Refreshes rendered dev svc fixture for updated queueConfig.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml	Refreshes rendered dev mgmt fixture for updated queueConfig.
config/rendered/dev/prow/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/rendered/dev/pers/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/rendered/dev/perf/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/rendered/dev/dev/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/rendered/dev/cspr/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/rendered/dev/ci01/westus3.yaml	Adds `purgeWAL: false` and removes CPU limit from rendered dev config.
config/config.yaml	Removes default CPU limit setting and introduces default `purgeWAL: false` for svc/mgmt.
config/config.schema.json	Extends schema to include the new `prometheusSpec.purgeWAL` boolean.

+            "purgeWAL": {
+              "type": "boolean"
            }


Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

hbhushan3 · 2026-05-26T20:52:50Z

+  initContainers:
+  - name: purge-wal
+    image: '{{ .Values.prometheusSpec.image.registry }}/{{ .Values.prometheusSpec.image.repository }}@sha256:{{ .Values.prometheusSpec.image.sha }}'
+    command: ["sh", "-c", "rm -rf /prometheus/wal && mkdir -p /prometheus/wal && echo 'WAL purged'"]


When this toggle is true, any node drain, image bump, or rolling restart silently drops all buffered samples. Was a one-shot Job or manual procedure considered instead of a permanent StatefulSet init container?

hbhushan3 · 2026-05-26T20:58:57Z

 {{- $none := "NONE" -}}
 {{- $setRequests := or (ne (.Values.prometheusSpec.resources.requests.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.requests.memory | toString) $none) -}}
-{{- $setLimits := or (ne (.Values.prometheusSpec.resources.limits.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.limits.memory | toString) $none) -}}
+{{- $setLimits := ne (.Values.prometheusSpec.resources.limits.memory | toString) $none -}}


Why are we removing cpu limit? Mgmt-plane Prometheus now shares nodes with HCP/CAPI controllers with no CPU ceiling. Is there a PriorityClass or dedicated nodepool bounding the blast radius during replay, or is the assumption that memory pressure will hit first?

Copilot AI review requested due to automatic review settings May 26, 2026 13:20

openshift-ci Bot requested review from hbhushan3 and stevekuznetsov May 26, 2026 13:20

openshift-ci Bot added the approved label May 26, 2026

Copilot started reviewing on behalf of janboll May 26, 2026 13:21 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread observability/prometheus/deploy/templates/prometheus.yaml

Comment thread config/config.schema.json

Comment on lines +782 to 784

"purgeWAL": {

"type": "boolean"

}

Review remarks

6950b14

Copilot AI review requested due to automatic review settings May 26, 2026 13:34

Copilot started reviewing on behalf of janboll May 26, 2026 13:35 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread observability/prometheus/deploy/templates/prometheus.yaml Outdated

Delete entire dir to make it error proof; copilot suggestion

fce222b

hbhushan3 reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390

Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390
janboll wants to merge 3 commits into
mainfrom
mitigate-wal-replay-issues

janboll commented May 26, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

hbhushan3 May 26, 2026

Uh oh!

hbhushan3 May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janboll commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What/Why

Uh oh!

openshift-ci Bot commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

hbhushan3 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hbhushan3 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janboll commented May 26, 2026 •

edited

Loading