Skip to content

Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390

Open
janboll wants to merge 3 commits into
mainfrom
mitigate-wal-replay-issues
Open

Mitigate Prometheus Agent WAL replay OOM crash loop (AROSLSRE-948)#5390
janboll wants to merge 3 commits into
mainfrom
mitigate-wal-replay-issues

Conversation

@janboll
Copy link
Copy Markdown
Collaborator

@janboll janboll commented May 26, 2026

https://redhat.atlassian.net/browse/AROSLSRE-948

What/Why

  • Tune remote_write queue: maxSamplesPerSend 2000→10000, capacity 2500→50000, maxShards 500→50
  • Add configurable purgeWAL init container to clear WAL before startup
  • Remove CPU limits for Prometheus to allow burst during replay

To promote this change enable purgeWAL in integration and stage

- Tune remote_write queue: maxSamplesPerSend 2000→10000, capacity 2500→50000, maxShards 500→50
- Add configurable purgeWAL init container to clear WAL before startup
- Remove CPU limits for Prometheus to allow burst during replay

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 13:20
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janboll

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets reducing PrometheusAgent crash loops during WAL replay by loosening CPU throttling, adding an optional WAL purge initContainer, and adjusting remote_write queue settings.

Changes:

  • Removed Prometheus CPU limits in svc/mgmt/opstool values and in the PrometheusAgent template rendering.
  • Added a new prometheusSpec.purgeWAL boolean (config + schema + rendered configs) that conditionally injects an initContainer to delete the WAL before startup.
  • Updated remote_write queueConfig defaults in the PrometheusAgent template and refreshed Helm test/rendered fixtures to match.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
observability/prometheus/values-svc.yaml Adds purgeWAL and removes CPU limit from svc Prometheus resources.
observability/prometheus/values-opstool.yaml Removes CPU limit from opstool Prometheus resources.
observability/prometheus/values-mgmt.yaml Adds purgeWAL and removes CPU limit from mgmt Prometheus resources.
observability/prometheus/deploy/templates/prometheus.yaml Stops rendering CPU limits, adds conditional WAL purge initContainer, adjusts remote_write queueConfig.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources.yaml Updates expected rendered output for svc (no CPU limit, new queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources_unset.yaml Updates expected rendered output for svc (queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml Updates expected rendered output for mgmt (no CPU limit, new queueConfig values).
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml Updates expected rendered output for mgmt (queueConfig values).
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_svc_1_arohcp_monitor.yaml Refreshes rendered dev svc fixture for updated queueConfig.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml Refreshes rendered dev mgmt fixture for updated queueConfig.
config/rendered/dev/prow/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/rendered/dev/pers/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/rendered/dev/perf/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/rendered/dev/dev/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/rendered/dev/cspr/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/rendered/dev/ci01/westus3.yaml Adds purgeWAL: false and removes CPU limit from rendered dev config.
config/config.yaml Removes default CPU limit setting and introduces default purgeWAL: false for svc/mgmt.
config/config.schema.json Extends schema to include the new prometheusSpec.purgeWAL boolean.

Comment thread observability/prometheus/deploy/templates/prometheus.yaml
Comment thread config/config.schema.json
Comment on lines +782 to 784
"purgeWAL": {
"type": "boolean"
}
Copilot AI review requested due to automatic review settings May 26, 2026 13:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread observability/prometheus/deploy/templates/prometheus.yaml Outdated
initContainers:
- name: purge-wal
image: '{{ .Values.prometheusSpec.image.registry }}/{{ .Values.prometheusSpec.image.repository }}@sha256:{{ .Values.prometheusSpec.image.sha }}'
command: ["sh", "-c", "rm -rf /prometheus/wal && mkdir -p /prometheus/wal && echo 'WAL purged'"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this toggle is true, any node drain, image bump, or rolling restart silently drops all buffered samples. Was a one-shot Job or manual procedure considered instead of a permanent StatefulSet init container?

{{- $none := "NONE" -}}
{{- $setRequests := or (ne (.Values.prometheusSpec.resources.requests.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.requests.memory | toString) $none) -}}
{{- $setLimits := or (ne (.Values.prometheusSpec.resources.limits.cpu | toString) $none) (ne (.Values.prometheusSpec.resources.limits.memory | toString) $none) -}}
{{- $setLimits := ne (.Values.prometheusSpec.resources.limits.memory | toString) $none -}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing cpu limit? Mgmt-plane Prometheus now shares nodes with HCP/CAPI controllers with no CPU ceiling. Is there a PriorityClass or dedicated nodepool bounding the blast radius during replay, or is the assumption that memory pressure will hit first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants