Add trainium, inferentia, and efa parameters to @kubernetes decorator#3086
Add trainium, inferentia, and efa parameters to @kubernetes decorator#3086emattia wants to merge 7 commits intoNetflix:masterfrom
Conversation
Greptile SummaryThis PR adds a
Confidence Score: 4/5Two execution paths (Argo Workflows non-parallel and Airflow) will silently fail to schedule on Neuron nodes because the required toleration is not injected; the core Kubernetes job/jobset paths work correctly. Two confirmed P1 bugs mean pods will not schedule on Trainium nodes when using Argo Workflows non-parallel path or Airflow. The core kubernetes_job and kubernetes_jobsets paths are correct. Score is 4 rather than lower because the feature works for the primary direct-Kubernetes execution path. metaflow/plugins/argo/argo_workflows.py and metaflow/plugins/airflow/airflow.py need the automatic Neuron toleration added to their pod specs.
|
| Filename | Overview |
|---|---|
| metaflow/plugins/kubernetes/kubernetes_decorator.py | Adds trainium as a new decorator attribute with mutual-exclusion check against gpu, integer validation, and CLI forwarding — correct, though the validator allows trainium=0 which would spuriously add a Neuron toleration. |
| metaflow/plugins/kubernetes/kubernetes_job.py | Correctly adds aws.amazon.com/neuron resource limit and automatically injects the aws.amazon.com/neuron:NoSchedule toleration when trainium is set. |
| metaflow/plugins/kubernetes/kubernetes_jobsets.py | Correctly adds aws.amazon.com/neuron resource limit and auto-injects the Neuron toleration for the parallel JobSet path, consistent with kubernetes_job.py. |
| metaflow/plugins/argo/argo_workflows.py | Adds Neuron resource limit to the non-parallel pod spec and threads trainium through to the JobSet path, but the non-parallel path omits the required aws.amazon.com/neuron:NoSchedule toleration — pods will fail to schedule on Neuron nodes. |
| metaflow/plugins/airflow/airflow.py | Adds Neuron resource limit to the resources dict but never adds a matching aws.amazon.com/neuron:NoSchedule toleration to the Airflow operator args, so pods will remain pending on tainted Neuron nodes. |
| metaflow/plugins/kubernetes/kubernetes_cli.py | Adds --trainium CLI option and threads it through to the step command correctly. |
| metaflow/plugins/kubernetes/kubernetes.py | Adds trainium parameter to both create_job and create_jobset methods and forwards it to the job/jobset constructors correctly. |
Comments Outside Diff (1)
-
metaflow/plugins/argo/argo_workflows.py, line 2796 (link)Missing automatic Neuron toleration in non-parallel Argo Workflows path
The non-parallel (non-JobSet) Argo Workflows pod spec adds the
aws.amazon.com/neuronresource limit (line ~2871) but does not inject the correspondingaws.amazon.com/neuron:NoScheduletoleration. Trainium/Inferentia nodes carry that taint by default, so any pod that reaches this code path withtrainium=Nwill remain inPendingstate — it will never be scheduled.The JobSet path correctly auto-injects the toleration (via
kubernetes_jobsets.py), andkubernetes_job.pydoes the same. The fix is to extend the toleration list here analogously:.tolerations( (resources.get("tolerations") or []) + ( [{"key": "aws.amazon.com/neuron", "operator": "Exists", "effect": "NoSchedule"}] if resources.get("trainium") is not None else [] ) )
Reviews (1): Last reviewed commit: "Add trainium parameter to @kubernetes de..." | Re-trigger Greptile
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
…efa=N) parameter on @kubernetes. When efa is set, the pod requests N vpc.amazonaws.com/efa resources, advertised by the AWS EFA k8s device plugin on EFA-enabled nodes. Plumbed through to argo and airflow runtimes consistently with how trainium= is.
PR Type
Summary
Mirror
@batch's AWS-accelerator surface on@kubernetes:@kubernetes(trainium=N)requests N AWS Trainium / Inferentia Neurondevices (
aws.amazon.com/neuronk8s resource).@kubernetes(inferentia=N)is an alias fortrainium, mirroring@batch(inferentia=N)for API consistency.@kubernetes(efa=N)requests N AWS Elastic Fabric Adapter networkinterfaces (
vpc.amazonaws.com/efak8s resource).Plumbed through
kubernetes_job,kubernetes_jobsets,kubernetes_cli,and the argo / airflow runtimes consistently with how the existing
gpuparameter is handled.Issue
No tracking issue. Supersedes the original PR scope of just
trainium.Brings the
@kubernetespath to parity with@batchfor AWS Neuronand EFA workloads, unblocking customers who run their own EKS clusters
and want first-class Neuron/EFA support without writing raw pod specs.
Reproduction
Runtime: kubernetes (EKS with AWS Neuron and EFA device plugins
installed; nodes labeled with the relevant accelerator).
Commands to run:
Where evidence shows up: task pod spec (
kubectl describe pod) andNCCL debug log inside the running container.
Before (master)
(also for
inferentia,efa)After (this PR)
Root Cause
Not a bug fix — net-new feature. The underlying Kubernetes resources
(
aws.amazon.com/neuron,vpc.amazonaws.com/efa) are advertised by therespective AWS device plugins;
@kuberneteshad no decorator-levelsurface to request them.
@batchalready exposedtrainium,inferentia, andefa. This PR brings@kubernetesto parity.Why This Fix Is Correct
@batch's API surface exactly.inferentiacollapses intotrainiumatstep_initand is popped before any runtime translation— same shape as
batch_decorator.py:175-211, only withtrainiumascanonical (since on K8s the underlying resource name is
aws.amazon.com/neuronand we surface what users running on Trainiumhardware naturally type first).
gpuandtrainiumareenforced as mutually exclusive (matching
@batch's convention).earlier in this branch;
efafollows the same pattern.Failure Modes Considered
gpu/gpu_vendorareunaffected — new attributes default to
Noneand resource-limitemission is gated on non-None values.
inferentiaandtrainiumraises a clear error in
step_init(mirrors@batch). Specifyingboth
gpuandtrainiumwas already enforced.inferentiais popped fromself.attributesafter collapsing intotrainium, so the runtimeCLI / argo / airflow translation only ever sees the canonical key.
kubernetes_job,kubernetes_jobsets, argo, and airflow consistently with howtrainiumwas already plumbed.efavalue validated as positive integer (mirrorstrainiumandtmpfs_sizevalidation patterns in the same file).Tests
mirroring existing
kube_utilstests with parametrize cases formutual-exclusion + resource-limit emission. Happy to land tests
either in this PR (push another commit) or a follow-up — let me
know reviewer preference.
EFA device plugins. Pod spec contains the right resource limits;
NCCL via aws-ofi-nccl selects EFA as the network backend.
truly exercise the runtime path, only static / unit checks).
Non-Goals
@batch(already has these parameters).--inferentiaCLI flag —inferentiais purely adecorator-time convenience that resolves to
trainiumbefore anyCLI invocation, mirroring
@batch's CLI which only exposes thecanonical name (
--inferentiafor batch sinceinferentiaiscanonical there;
--trainiumfor k8s sincetrainiumis canonicalhere).
(
FI_PROVIDER,FI_EFA_USE_DEVICE_RDMA). Users set those via@environmentfor now; auto-injection is a separate ergonomics PR.should target — that's a cluster-side concern (instance allowlist
AI Tool Usage
selection, Karpenter EFA NIC layout prior art, and drafting this
PR description). All generated code reviewed, understood, and
tested end-to-end on a live cluster.