Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390
Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390janboll wants to merge 3 commits into
Conversation
- Tune remote_write queue: maxSamplesPerSend 2000→10000, capacity 2500→50000, maxShards 500→50 - Add configurable purgeWAL init container to clear WAL before startup - Remove CPU limits for Prometheus to allow burst during replay Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: janboll The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR targets reducing PrometheusAgent crash loops during WAL replay by loosening CPU throttling, adding an optional WAL purge initContainer, and adjusting remote_write queue settings.
Changes:
- Removed Prometheus CPU limits in svc/mgmt/opstool values and in the PrometheusAgent template rendering.
- Added a new
prometheusSpec.purgeWALboolean (config + schema + rendered configs) that conditionally injects an initContainer to delete the WAL before startup. - Updated remote_write
queueConfigdefaults in the PrometheusAgent template and refreshed Helm test/rendered fixtures to match.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| observability/prometheus/values-svc.yaml | Adds purgeWAL and removes CPU limit from svc Prometheus resources. |
| observability/prometheus/values-opstool.yaml | Removes CPU limit from opstool Prometheus resources. |
| observability/prometheus/values-mgmt.yaml | Adds purgeWAL and removes CPU limit from mgmt Prometheus resources. |
| observability/prometheus/deploy/templates/prometheus.yaml | Stops rendering CPU limits, adds conditional WAL purge initContainer, adjusts remote_write queueConfig. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources.yaml | Updates expected rendered output for svc (no CPU limit, new queueConfig values). |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources_unset.yaml | Updates expected rendered output for svc (queueConfig values). |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml | Updates expected rendered output for mgmt (no CPU limit, new queueConfig values). |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml | Updates expected rendered output for mgmt (queueConfig values). |
| dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_svc_1_arohcp_monitor.yaml | Refreshes rendered dev svc fixture for updated queueConfig. |
| dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml | Refreshes rendered dev mgmt fixture for updated queueConfig. |
| config/rendered/dev/prow/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/rendered/dev/pers/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/rendered/dev/perf/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/rendered/dev/dev/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/rendered/dev/cspr/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/rendered/dev/ci01/westus3.yaml | Adds purgeWAL: false and removes CPU limit from rendered dev config. |
| config/config.yaml | Removes default CPU limit setting and introduces default purgeWAL: false for svc/mgmt. |
| config/config.schema.json | Extends schema to include the new prometheusSpec.purgeWAL boolean. |
| "purgeWAL": { | ||
| "type": "boolean" | ||
| } |
| initContainers: | ||
| - name: purge-wal | ||
| image: '{{ .Values.prometheusSpec.image.registry }}/{{ .Values.prometheusSpec.image.repository }}@sha256:{{ .Values.prometheusSpec.image.sha }}' | ||
| command: ["sh", "-c", "rm -rf /prometheus/wal && mkdir -p /prometheus/wal && echo 'WAL purged'"] |
There was a problem hiding this comment.
When this toggle is true, any node drain, image bump, or rolling restart silently drops all buffered samples. Was a one-shot Job or manual procedure considered instead of a permanent StatefulSet init container?
| {{- $none := "NONE" -}} | ||
| {{- $setRequests := or (ne (.Values.prometheusSpec.resources.requests.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.requests.memory | toString) $none) -}} | ||
| {{- $setLimits := or (ne (.Values.prometheusSpec.resources.limits.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.limits.memory | toString) $none) -}} | ||
| {{- $setLimits := ne (.Values.prometheusSpec.resources.limits.memory | toString) $none -}} |
There was a problem hiding this comment.
Why are we removing cpu limit? Mgmt-plane Prometheus now shares nodes with HCP/CAPI controllers with no CPU ceiling. Is there a PriorityClass or dedicated nodepool bounding the blast radius during replay, or is the assumption that memory pressure will hit first?
https://redhat.atlassian.net/browse/AROSLSRE-948
What/Why
To promote this change enable
purgeWALin integration and stage