feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53
Open
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
Open
feat(kfpytorch): add use_pytorch_job flag on Elastic for nnodes=1#53devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
devin-ai-integration[bot] wants to merge 1 commit intomasterfrom
Conversation
Single-node Elastic (nnodes=1) uses task_type=python-task, which lands the training pod in the flyte launcher's auto-created PodGroup with minMember=1. The launcher Succeeds almost immediately, the PodGroup hits phase=Completed before the training pod is Pending, and volcano's preempt action skips Completed PodGroups. Gang scheduling and priority-based preemption therefore never evaluate single-node Elastic tasks. Multi-node already avoids this because task_type=pytorch emits a PyTorchJob CRD and the kubeflow training-operator creates a dedicated PodGroup keyed on the PyTorchJob with minMember=replicas, independent of the launcher. Add an opt-in use_pytorch_job flag on Elastic (default False, so behavior is unchanged). When True, force task_type=pytorch and emit a DistributedPyTorchTrainingTask with min=max=nnodes even for nnodes=1, so single-node runs get the same dedicated PodGroup.
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Single-node Elastic (
nnodes=1) takestask_type="python-task"and skips the PyTorchJob CRD entirely. The training pod then ends up in the flyte launcher's auto-created PodGroup withminMember=1. The launcher Succeeds almost immediately, the PodGroup hitsphase=Completedbefore the training pod is Pending, and volcano's preempt action iterates non-terminal PodGroups only — so gang scheduling and priority-based preemption never evaluate single-node Elastic tasks.Multi-node already avoids this:
task_type="pytorch"emits a PyTorchJob CRD, the kubeflow training-operator creates a dedicated PodGroup keyed on the PyTorchJob withminMember=replicas, independent of the launcher. Preempt sees it as a real candidate.We hit this while wiring volcano preemption for interruptible training runs. With
nnodes=1the victim's PodGroup wasphase=Completed, succeeded=1, minMember=1and volcano never evicted. Flipping the same test harness tonnodes=2fired theEvictevent end-to-end.What changes were proposed in this pull request?
Add an opt-in
use_pytorch_job: bool = Falsefield onElastic(default preserves current single-node-as-standalone-pod behavior). WhenTrue,task_typeis forced to"pytorch"andget_custom()emits aDistributedPyTorchTrainingTaskwithmin_replicas = max_replicas = nnodeseven fornnodes=1, so single-node Elastic opts into the PyTorchJob CRD path and gets a dedicated training-operator-managed PodGroup.Implementation centralizes the branching in a
_resolve_task_type()staticmethod onPytorchElasticFunctionTaskso__init__, thetask_typeproperty, andget_custom()all agree — including whentask_configis replaced viawith_overrides().Callers that need gang scheduling or priority-based preemption for single-node jobs set:
How was this patch tested?
test_use_pytorch_job_forces_pytorchjob_for_single_node: verifiestask_type=="pytorch"with the flag and"python-task"without.test_use_pytorch_job_emits_elastic_custom_for_single_node: verifiesget_custom()emitsworkerReplicas.replicas=1andelasticConfig.minReplicas=maxReplicas=1with the flag, and falls through to the standalone path without.plugins/flytekit-kf-pytorch/tests/suite. The existingtest_end_to_end[spawn|fork]andtest_output_metadata_passing[spawn]failures reproduce on master and are unrelated (torch distributed runtime env).Check all the applicable boxes
Link to Devin session: https://app.devin.ai/sessions/70e7b3c4299647bead08616cf9ff2a3a