Skip to content

Enable dynamic nnodes override in flytekit#3

Open
jld-adriano wants to merge 1 commit intomasterfrom
cursor/enable-dynamic-nnodes-override-in-flytekit-e3ee
Open

Enable dynamic nnodes override in flytekit#3
jld-adriano wants to merge 1 commit intomasterfrom
cursor/enable-dynamic-nnodes-override-in-flytekit-e3ee

Conversation

@jld-adriano
Copy link
Copy Markdown

Why are the changes needed?

This PR addresses the user's need to dynamically override the number of nodes (nnodes) for PyTorch tasks at runtime. Previously, the worker_replicas count for a PyTorchJob was fixed at task registration, preventing a single registered task from switching between single-pod and multi-node execution modes via runtime overrides.

This change enables the backend to adjust the PyTorchJob's replica count based on environment variables, allowing a task registered for multi-node execution to effectively run as a single-pod job (master only) when overridden.

What changes were proposed in this pull request?

This pull request proposes changes to flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go to introduce runtime overrides for PyTorchJob worker replicas and elastic policy.

Specifically:

  • The BuildResource method now inspects environment variables from TaskExecutionMetadata.
  • If PET_NNODES is set, it parses the value (single integer or "min:max" range) to dynamically adjust:
    • For non-elastic jobs: worker_replicas (setting to 0 for single-node overrides, i.e., PET_NNODES=1).
    • For elastic jobs: MinReplicas, MaxReplicas, and Replicas in the ElasticPolicy.
  • If FLYTE_PYTORCH_WORKERS is set, it explicitly overrides the worker_replicas and, for elastic jobs, also sets MaxReplicas (clamping MinReplicas if necessary).

This allows a PyTorch task registered as a pytorch type to dynamically switch its replica configuration at launch time.

How was this patch tested?

The patch was tested by successfully compiling the affected package using go build ./....
Further unit tests covering the new override logic should be added.

Labels

  • changed
  • added

Setup process

Not applicable for this backend change.

Screenshots

Not applicable.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link


Slack Thread

Open in Cursor Open in Web

@cursor
Copy link
Copy Markdown

cursor Bot commented Aug 7, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@cursor cursor Bot force-pushed the cursor/enable-dynamic-nnodes-override-in-flytekit-e3ee branch from cab91fd to 42ec508 Compare August 7, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants