feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1 by devin-ai-integration[bot] · Pull Request #53 · exa-labs/flytekit

devin-ai-integration · 2026-04-22T00:28:42Z

Why are the changes needed?

Single-node Elastic (nnodes=1) takes task_type="python-task" and skips the PyTorchJob CRD entirely. The training pod then ends up in the flyte launcher's auto-created PodGroup with minMember=1. The launcher Succeeds almost immediately, the PodGroup hits phase=Completed before the training pod is Pending, and volcano's preempt action iterates non-terminal PodGroups only — so gang scheduling and priority-based preemption never evaluate single-node Elastic tasks.

Multi-node already avoids this: task_type="pytorch" emits a PyTorchJob CRD, the kubeflow training-operator creates a dedicated PodGroup keyed on the PyTorchJob with minMember=replicas, independent of the launcher. Preempt sees it as a real candidate.

We hit this while wiring volcano preemption for interruptible training runs. With nnodes=1 the victim's PodGroup was phase=Completed, succeeded=1, minMember=1 and volcano never evicted. Flipping the same test harness to nnodes=2 fired the Evict event end-to-end.

What changes were proposed in this pull request?

Add an opt-in use_pytorch_job: bool = False field on Elastic (default preserves current single-node-as-standalone-pod behavior). When True, task_type is forced to "pytorch" and get_custom() emits a DistributedPyTorchTrainingTask with min_replicas = max_replicas = nnodes even for nnodes=1, so single-node Elastic opts into the PyTorchJob CRD path and gets a dedicated training-operator-managed PodGroup.

Implementation centralizes the branching in a _resolve_task_type() staticmethod on PytorchElasticFunctionTask so __init__, the task_type property, and get_custom() all agree — including when task_config is replaced via with_overrides().

Callers that need gang scheduling or priority-based preemption for single-node jobs set:

@task(task_config=Elastic(nnodes=1, nproc_per_node=8, use_pytorch_job=True))
def train():
    ...

How was this patch tested?

Added test_use_pytorch_job_forces_pytorchjob_for_single_node: verifies task_type=="pytorch" with the flag and "python-task" without.
Added test_use_pytorch_job_emits_elastic_custom_for_single_node: verifies get_custom() emits workerReplicas.replicas=1 and elasticConfig.minReplicas=maxReplicas=1 with the flag, and falls through to the standalone path without.
Ran the full plugins/flytekit-kf-pytorch/tests/ suite. The existing test_end_to_end[spawn|fork] and test_output_metadata_passing[spawn] failures reproduce on master and are unrelated (torch distributed runtime env).

Check all the applicable boxes

All new and existing tests passed.

Link to Devin session: https://app.devin.ai/sessions/70e7b3c4299647bead08616cf9ff2a3a

Single-node Elastic (nnodes=1) uses task_type=python-task, which lands the training pod in the flyte launcher's auto-created PodGroup with minMember=1. The launcher Succeeds almost immediately, the PodGroup hits phase=Completed before the training pod is Pending, and volcano's preempt action skips Completed PodGroups. Gang scheduling and priority-based preemption therefore never evaluate single-node Elastic tasks. Multi-node already avoids this because task_type=pytorch emits a PyTorchJob CRD and the kubeflow training-operator creates a dedicated PodGroup keyed on the PyTorchJob with minMember=replicas, independent of the launcher. Add an opt-in use_pytorch_job flag on Elastic (default False, so behavior is unchanged). When True, force task_type=pytorch and emit a DistributedPyTorchTrainingTask with min=max=nnodes even for nnodes=1, so single-node runs get the same dedicated PodGroup.

devin-ai-integration · 2026-04-22T00:28:44Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53

feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin/1776817211-elastic-force-pytorchjob

devin-ai-integration Bot commented Apr 22, 2026

Uh oh!

devin-ai-integration Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

devin-ai-integration Bot commented Apr 22, 2026

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Check all the applicable boxes

Uh oh!

devin-ai-integration Bot commented Apr 22, 2026

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants