Fallback to python task if worker is zero for pytorch by ByronHsu · Pull Request #1629 · flyteorg/flytekit

ByronHsu · 2023-05-10T06:44:11Z

TL;DR

Fallback to python task if worker is zero for pytorch

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Tracking Issue

flyteorg/flyteplugins#348

Signed-off-by: byhsu <byhsu@linkedin.com>

codecov · 2023-05-10T06:55:56Z

Codecov Report

Merging #1629 (6e01600) into master (35bb556) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1629   +/-   ##
=======================================
  Coverage   71.22%   71.22%           
=======================================
  Files         334      334           
  Lines       30391    30391           
  Branches     5490     5490           
=======================================
  Hits        21645    21645           
  Misses       8206     8206           
  Partials      540      540

fg91 · 2023-05-10T07:33:07Z

plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py

+            task_type=task_type,
            **kwargs,
        )



Two questions:

I think we might need to handle workers=0 separately in get_custom since in this case we don't want to create a PytorchJob. (See how this is done for elastic task below)

Currently, task_config=Pytorch(workers=0) is equivalent to no task_config at all. However, torch.distributed.init_process_group() will not work without the env vars set by the operator. We could solve this by overwriting the execute method and simply setting the env vars WORLD_SIZE=1, RANK=0, and potentially the master address (would have to try whether it is required).

What do you y'all think about throwing an error if workers=0 and telling people to use a standard python config if they want to run it on a single machine?

If people really want to set workers to 0 then I understand having a smooth fallback, but otherwise it could confuse people if they make a mistake.

@ByronHsu @wild-endeavor the new pytorch elastic task can run locally and in a single k8s pod but also with multiple workers using kubeflow training operator. I'd say its functionality is a superset of the already existing PyTorch task config. What do you think about using this one in order to debug dist training with a single worker @ByronHsu ?

I think falling back to a normal pod (without kubeflow operator) when doing task_config=PyTorch(num_workers=0) doesn't make much sense because the env vars like MASTER_ADDR, RANK, ... required by torch.distributed.init_process_group(), ... will not be set, neither by the kubeflow operator, nor by the pytorch task logic and distributed training, thus, cannot be tested.

I would propose to either allow num_workers=0 in PyTorch task but use kubeflow training operator also in this case (when users don't want to use the training operator, they can use Elastic) or 2) not allow num_workers=0 as is the case now.

wild-endeavor

maybe @peridotml can chime in on this pr?

kumare3 · 2023-10-05T05:53:19Z

should this be closed?

Fallback to python task if worker is zero for pytorch

9071325

Signed-off-by: byhsu <byhsu@linkedin.com>

ByronHsu requested review from cosmicBboy, eapolinario, kumare3, pingsutw and wild-endeavor as code owners May 10, 2023 06:44

improve

6e01600

Signed-off-by: byhsu <byhsu@linkedin.com>

fg91 reviewed May 10, 2023

View reviewed changes

wild-endeavor reviewed May 15, 2023

View reviewed changes

ByronHsu closed this Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to python task if worker is zero for pytorch#1629

Fallback to python task if worker is zero for pytorch#1629
ByronHsu wants to merge 2 commits intoflyteorg:masterfrom
ByronHsu:pytorch-fix

ByronHsu commented May 10, 2023 •

edited

Loading

Uh oh!

codecov bot commented May 10, 2023 •

edited

Loading

Uh oh!

fg91 May 10, 2023

Uh oh!

peridotml May 15, 2023

Uh oh!

fg91 May 19, 2023

Uh oh!

wild-endeavor left a comment

Uh oh!

kumare3 commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ByronHsu commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Type

Are all requirements met?

Tracking Issue

Uh oh!

codecov bot commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fg91 May 10, 2023

Choose a reason for hiding this comment

Uh oh!

peridotml May 15, 2023

Choose a reason for hiding this comment

Uh oh!

fg91 May 19, 2023

Choose a reason for hiding this comment

Uh oh!

wild-endeavor left a comment

Choose a reason for hiding this comment

Uh oh!

kumare3 commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ByronHsu commented May 10, 2023 •

edited

Loading

codecov bot commented May 10, 2023 •

edited

Loading