Fix recurring dcgm-exporter OOMKill on dense-GPU nodes by huydhn · Pull Request #799 · pytorch/ci-infra

huydhn · 2026-06-19T01:37:11Z

Impact: GPU monitoring/alerting on multi-GPU nodes (dcgm-exporter DaemonSet)
Risk: low

I notice this persistent smoke test failures on prod cluster https://github.com/pytorch/ci-infra/actions/runs/27790998116/job/82253970946

dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU monitoring/alerting there. A single pod reads every GPU on its node and the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low on 8-GPU nodes (g5/g6/p4d .48xlarge). On arc-cbr-production, 13 pods have OOMKill history — all on high-GPU-count nodes; the smoke test caught one at 5 restarts (run). The prior fix (#631) did not cover the densest nodes.

Raises the memory limit 512Mi → 1Gi and GOMEMLIMIT 450MiB → 768MiB. 1Gi is ~0.1% of these hosts' RAM, so node packing and cost are unaffected.

dcgm-exporter pods OOMKill-loop on multi-GPU nodes, blinding GPU monitoring/alerting there. A single pod reads every GPU on its node and the native DCGM (cgo) memory scales with GPU count, so 512Mi is too low on 8-GPU nodes (g5/g6/p4d .48xlarge). 13 pods on arc-cbr-production have OOMKill history, all on high-GPU-count nodes; the smoke test caught one at 5 restarts. The prior fix (pytorch#631) did not cover the densest nodes. Raise the memory limit 512Mi -> 1Gi and GOMEMLIMIT 450MiB -> 768MiB (both must move together). 1Gi is ~0.1% of these hosts' RAM, so node packing and cost are unaffected.

huydhn · 2026-06-19T01:41:19Z

+            # allocations sit on top and scale with GPU count.
            - name: GOMEMLIMIT
-              value: "450MiB"
+              value: "768MiB"


I ask Claude about this number. Here is its reasoning:

The reasoning GOMEMLIMIT should sit below the hard memory limit, because it only bounds the Go heap — the native DCGM (cgo) allocations and Go runtime non-heap overhead sit on top of it and are not counted against it. So the gap between GOMEMLIMIT and the hard limit is the headroom left for everything GOMEMLIMIT doesn't see. - Hard limit: 1Gi = 1024 MiB - GOMEMLIMIT: 768 MiB → leaves 256 MiB of headroom for native + non-heap.

datadog-pytorch-via-lf · 2026-06-19T02:07:48Z

📊 dbt CICD (full report)

Impact Lineage

⚠️ Analysis failed. An error occurred while computing the impact lineage for this PR.

Error details

Error fetching impact lineage results.

Drift Detection

⚠️ Drift detection failed. An error occurred while computing prod-vs-CI tests for this PR.

Error details

Error fetching drift detection results.

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 3431dab | Docs | Give us feedback!}

huydhn requested a review from jeanschmidt as a code owner June 19, 2026 01:37

huydhn commented Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recurring dcgm-exporter OOMKill on dense-GPU nodes#799

Fix recurring dcgm-exporter OOMKill on dense-GPU nodes#799
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:fix-dcgm-exporter-oom-dense-gpu-nodes

huydhn commented Jun 19, 2026 •

edited

Loading

Uh oh!

huydhn Jun 19, 2026

Uh oh!

datadog-pytorch-via-lf Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huydhn commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

datadog-pytorch-via-lf Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huydhn commented Jun 19, 2026 •

edited

Loading