Fallback to python task if worker is zero for pytorch#1629
Fallback to python task if worker is zero for pytorch#1629ByronHsu wants to merge 2 commits intoflyteorg:masterfrom
Conversation
Signed-off-by: byhsu <byhsu@linkedin.com>
Codecov Report
@@ Coverage Diff @@
## master #1629 +/- ##
=======================================
Coverage 71.22% 71.22%
=======================================
Files 334 334
Lines 30391 30391
Branches 5490 5490
=======================================
Hits 21645 21645
Misses 8206 8206
Partials 540 540 |
| task_type=task_type, | ||
| **kwargs, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Two questions:
- I think we might need to handle
workers=0separately inget_customsince in this case we don't want to create aPytorchJob. (See how this is done for elastic task below) - Currently,
task_config=Pytorch(workers=0)is equivalent to notask_configat all. However,torch.distributed.init_process_group()will not work without the env vars set by the operator. We could solve this by overwriting theexecutemethod and simply setting the env varsWORLD_SIZE=1,RANK=0, and potentially the master address (would have to try whether it is required).
There was a problem hiding this comment.
What do you y'all think about throwing an error if workers=0 and telling people to use a standard python config if they want to run it on a single machine?
If people really want to set workers to 0 then I understand having a smooth fallback, but otherwise it could confuse people if they make a mistake.
There was a problem hiding this comment.
@ByronHsu @wild-endeavor the new pytorch elastic task can run locally and in a single k8s pod but also with multiple workers using kubeflow training operator. I'd say its functionality is a superset of the already existing PyTorch task config. What do you think about using this one in order to debug dist training with a single worker @ByronHsu ?
I think falling back to a normal pod (without kubeflow operator) when doing task_config=PyTorch(num_workers=0) doesn't make much sense because the env vars like MASTER_ADDR, RANK, ... required by torch.distributed.init_process_group(), ... will not be set, neither by the kubeflow operator, nor by the pytorch task logic and distributed training, thus, cannot be tested.
I would propose to either allow num_workers=0 in PyTorch task but use kubeflow training operator also in this case (when users don't want to use the training operator, they can use Elastic) or 2) not allow num_workers=0 as is the case now.
wild-endeavor
left a comment
There was a problem hiding this comment.
maybe @peridotml can chime in on this pr?
|
should this be closed? |
TL;DR
Fallback to python task if worker is zero for pytorch
Type
Are all requirements met?
Tracking Issue
flyteorg/flyteplugins#348