Fix recurring dcgm-exporter OOMKill on dense-GPU nodes#799
Open
huydhn wants to merge 1 commit into
Open
Conversation
dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU monitoring/alerting there. A single pod reads every GPU on its node and the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low on 8-GPU nodes (g5/g6/p4d .48xlarge). 13 pods on arc-cbr-production have OOMKill history, all on high-GPU-count nodes; the smoke test caught one at 5 restarts. The prior fix (pytorch#631) did not cover the densest nodes. Raise the memory limit 512Mi -> 1Gi and GOMEMLIMIT 450MiB -> 768MiB (both must move together). 1Gi is ~0.1% of these hosts' RAM, so node packing and cost are unaffected.
huydhn
commented
Jun 19, 2026
| # allocations sit on top and scale with GPU count. | ||
| - name: GOMEMLIMIT | ||
| value: "450MiB" | ||
| value: "768MiB" |
Contributor
Author
There was a problem hiding this comment.
I ask Claude about this number. Here is its reasoning:
The reasoning
GOMEMLIMIT should sit below the hard memory limit, because it only bounds the Go heap — the native DCGM (cgo) allocations and Go runtime
non-heap overhead sit on top of it and are not counted against it. So the gap between GOMEMLIMIT and the hard limit is the headroom left
for everything GOMEMLIMIT doesn't see.
- Hard limit: 1Gi = 1024 MiB
- GOMEMLIMIT: 768 MiB → leaves 256 MiB of headroom for native + non-heap.
|
📊 dbt CICD (full report) Impact Lineage Error detailsError fetching impact lineage results. Drift Detection Error detailsError fetching drift detection results. 🔗 Commit SHA: 3431dab | Docs | Give us feedback! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Impact: GPU monitoring/alerting on multi-GPU nodes (dcgm-exporter DaemonSet)
Risk: low
I notice this persistent smoke test failures on prod cluster https://github.com/pytorch/ci-infra/actions/runs/27790998116/job/82253970946
dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU monitoring/alerting there. A single pod reads every GPU on its node and the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low on 8-GPU nodes (g5/g6/p4d .48xlarge). On arc-cbr-production, 13 pods have OOMKill history — all on high-GPU-count nodes; the smoke test caught one at 5 restarts (run). The prior fix (#631) did not cover the densest nodes.
Raises the memory limit 512Mi → 1Gi and GOMEMLIMIT 450MiB → 768MiB. 1Gi is ~0.1% of these hosts' RAM, so node packing and cost are unaffected.