Skip to content

Fix recurring dcgm-exporter OOMKill on dense-GPU nodes#799

Open
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:fix-dcgm-exporter-oom-dense-gpu-nodes
Open

Fix recurring dcgm-exporter OOMKill on dense-GPU nodes#799
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:fix-dcgm-exporter-oom-dense-gpu-nodes

Conversation

@huydhn

@huydhn huydhn commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Impact: GPU monitoring/alerting on multi-GPU nodes (dcgm-exporter DaemonSet)
Risk: low

I notice this persistent smoke test failures on prod cluster https://github.com/pytorch/ci-infra/actions/runs/27790998116/job/82253970946

dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU monitoring/alerting there. A single pod reads every GPU on its node and the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low on 8-GPU nodes (g5/g6/p4d .48xlarge). On arc-cbr-production, 13 pods have OOMKill history — all on high-GPU-count nodes; the smoke test caught one at 5 restarts (run). The prior fix (#631) did not cover the densest nodes.

Raises the memory limit 512Mi → 1Gi and GOMEMLIMIT 450MiB → 768MiB. 1Gi is ~0.1% of these hosts' RAM, so node packing and cost are unaffected.

dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU
monitoring/alerting there. A single pod reads every GPU on its node and
the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low
on 8-GPU nodes (g5/g6/p4d .48xlarge). 13 pods on arc-cbr-production have
OOMKill history, all on high-GPU-count nodes; the smoke test caught one
at 5 restarts. The prior fix (pytorch#631) did not cover the densest nodes.

Raise the memory limit 512Mi -> 1Gi and GOMEMLIMIT 450MiB -> 768MiB
(both must move together). 1Gi is ~0.1% of these hosts' RAM, so node
packing and cost are unaffected.
@huydhn huydhn requested a review from jeanschmidt as a code owner June 19, 2026 01:37
# allocations sit on top and scale with GPU count.
- name: GOMEMLIMIT
value: "450MiB"
value: "768MiB"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask Claude about this number. Here is its reasoning:

The reasoning

GOMEMLIMIT should sit below the hard memory limit, because it only bounds the Go heap — the native DCGM (cgo) allocations and Go runtime
non-heap overhead sit on top of it and are not counted against it. So the gap between GOMEMLIMIT and the hard limit is the headroom left
for everything GOMEMLIMIT doesn't see.

- Hard limit: 1Gi = 1024 MiB
- GOMEMLIMIT: 768 MiB → leaves 256 MiB of headroom for native + non-heap.

@datadog-pytorch-via-lf

Copy link
Copy Markdown

📊 dbt CICD (full report)

Impact Lineage

⚠️ Analysis failed. An error occurred while computing the impact lineage for this PR.

Error details

Error fetching impact lineage results.

Drift Detection

⚠️ Drift detection failed. An error occurred while computing prod-vs-CI tests for this PR.

Error details

Error fetching drift detection results.

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 3431dab | Docs | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant