Skip to content

feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53

Open
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin/1776817211-elastic-force-pytorchjob
Open

feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin/1776817211-elastic-force-pytorchjob

Conversation

@devin-ai-integration
Copy link
Copy Markdown

Why are the changes needed?

Single-node Elastic (nnodes=1) takes task_type="python-task" and skips the PyTorchJob CRD entirely. The training pod then ends up in the flyte launcher's auto-created PodGroup with minMember=1. The launcher Succeeds almost immediately, the PodGroup hits phase=Completed before the training pod is Pending, and volcano's preempt action iterates non-terminal PodGroups only — so gang scheduling and priority-based preemption never evaluate single-node Elastic tasks.

Multi-node already avoids this: task_type="pytorch" emits a PyTorchJob CRD, the kubeflow training-operator creates a dedicated PodGroup keyed on the PyTorchJob with minMember=replicas, independent of the launcher. Preempt sees it as a real candidate.

We hit this while wiring volcano preemption for interruptible training runs. With nnodes=1 the victim's PodGroup was phase=Completed, succeeded=1, minMember=1 and volcano never evicted. Flipping the same test harness to nnodes=2 fired the Evict event end-to-end.

What changes were proposed in this pull request?

Add an opt-in use_pytorch_job: bool = False field on Elastic (default preserves current single-node-as-standalone-pod behavior). When True, task_type is forced to "pytorch" and get_custom() emits a DistributedPyTorchTrainingTask with min_replicas = max_replicas = nnodes even for nnodes=1, so single-node Elastic opts into the PyTorchJob CRD path and gets a dedicated training-operator-managed PodGroup.

Implementation centralizes the branching in a _resolve_task_type() staticmethod on PytorchElasticFunctionTask so __init__, the task_type property, and get_custom() all agree — including when task_config is replaced via with_overrides().

Callers that need gang scheduling or priority-based preemption for single-node jobs set:

@task(task_config=Elastic(nnodes=1, nproc_per_node=8, use_pytorch_job=True))
def train():
    ...

How was this patch tested?

  • Added test_use_pytorch_job_forces_pytorchjob_for_single_node: verifies task_type=="pytorch" with the flag and "python-task" without.
  • Added test_use_pytorch_job_emits_elastic_custom_for_single_node: verifies get_custom() emits workerReplicas.replicas=1 and elasticConfig.minReplicas=maxReplicas=1 with the flag, and falls through to the standalone path without.
  • Ran the full plugins/flytekit-kf-pytorch/tests/ suite. The existing test_end_to_end[spawn|fork] and test_output_metadata_passing[spawn] failures reproduce on master and are unrelated (torch distributed runtime env).

Check all the applicable boxes

  • All new and existing tests passed.

Link to Devin session: https://app.devin.ai/sessions/70e7b3c4299647bead08616cf9ff2a3a

Single-node Elastic (nnodes=1) uses task_type=python-task, which lands
the training pod in the flyte launcher's auto-created PodGroup with
minMember=1. The launcher Succeeds almost immediately, the PodGroup
hits phase=Completed before the training pod is Pending, and volcano's
preempt action skips Completed PodGroups. Gang scheduling and
priority-based preemption therefore never evaluate single-node Elastic
tasks.

Multi-node already avoids this because task_type=pytorch emits a
PyTorchJob CRD and the kubeflow training-operator creates a dedicated
PodGroup keyed on the PyTorchJob with minMember=replicas, independent
of the launcher.

Add an opt-in use_pytorch_job flag on Elastic (default False, so
behavior is unchanged). When True, force task_type=pytorch and emit a
DistributedPyTorchTrainingTask with min=max=nnodes even for nnodes=1,
so single-node runs get the same dedicated PodGroup.
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants