[feature] Multi-GPU support#17
[feature] Multi-GPU support#17yoshikisd wants to merge 125 commits intocdtools-developers:masterfrom
Conversation
…ction speed and losses as a function of GPU counts
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
…ctor_wrapper also handles DDP and takes timeout as an integer
…renamed to distributed_wrapper.
This comment was marked as outdated.
This comment was marked as outdated.
|
I've gotten both the reconstructor and multi-GPU implementation into a stable state that's ready for review. It may be best to have a call to go over these changes. But I'll leave this documentation here providing an overview of the changes made with this PR.
|
src/cdtools/models/base.py
Outdated
| from cdtools.reconstructors import Adam | ||
| self.reconstructor = Adam(model=self, | ||
| dataset=dataset, | ||
| subset=subset) |
There was a problem hiding this comment.
This needs to be updated so that if the user changes which optimizer they're using, it doesn't pass a silent error. I would lean towards just creating a new optimizer for each new function (less good behavior, but preserving old behavior for old scripts), but I could be swayed for sure. What do other folks think?
There was a problem hiding this comment.
We should update the examples to use the new pattern with an explicitly constructed reconstructor
| def __init__(self, | ||
| model: CDIModel, | ||
| dataset: CDataset, | ||
| subset: Union[int, List[int]] = None): |
There was a problem hiding this comment.
Maybe subset goes in the optimize function call?
| self.model.training_history += self.model.report() + '\n' | ||
| return loss | ||
|
|
||
| def optimize(self, |
There was a problem hiding this comment.
I think we haven't quite hit the perfect code pattern yet for how to update learning rates, minibatch sizes, etc. We should cycle back.
|
Per discussions with @allevitan, we've decided it would be best to separate this gargantuan PR into two. The first will deal with the creation of the For now, I'll keep this PR open if (1) we can nicely merge the changes made by the first PR into this one and (2) we can squash the commits. Otherwise, I'll just create another feature branch for multi-GPU. |
|
I'll be closing this PR in favor of opening up a different one which, IMO, has a somewhat nicer implementation of multi-GPU operation that avoids having to set up processes to just call torchrun. The cost of doing this is adding a few extra lines of code into the reconstruction script. But, perhaps this implementation may be easier to review. Also, having >100 commits (majority of them being experimental) might make our lives difficult if we ever need to roll back to a previous version for whatever reason. |


Summary of added feature (as of 07/23/2025)
Check out this comment for a summary of features added to enable multi-GPU support for CDTools. This update also includes a separation of the optimizer from the CDIModel, which was done in an attempt to directly use PyTorch's DistributedDataParallel.
The outdated first comment
This PR is a starting point to address Issue #8: adding multi-GPU support for CDTools.
This is a work-in-progress. I'm interested in exploring a couple different parallelization approaches while trying to preserve the simplicity of the high-level CDTools interface and ensure backwards-compatability. This PR is not in a state that I feel would be ready to merge into the master branch. I've submitted this as a draft request to see if there's any thoughts you folks may have about handling multi-GPU support. If you have any recommendations on things to try/test, I'd be happy to discuss it!
Multi-GPU implementation based on DistributedDataParallel
I've gotten one naive implementation of multi-GPU support operational using PyTorch DistributedDataParallel to perform data parallelism (more details here https://pytorch.org/tutorials/beginner/dist_overview.html):
examples/fancy_ptycho.pycalledexamples/fancy_ptycho_multi_gpu_ddp.py. The dataset and model inspection methods work even when using multiple GPUs.examples/fancy_ptycho_multi_gpu_ddp_speed_test.pyto perform a comparative test of the reconstruction speed/loss as a function of the number of GPUs used. This is based onexamples/fancy_ptycho.pyandexamples/fancy_ptycho_multi_gpu_ddp.pyBelow is an output from

examples/fancy_ptycho_multi_gpu_ddp_speed_test.pytested using up to 2 NVIDIA RTX 6000 Ada Generation cards on a Linux server. Both a single- and double-GPU test were ran with 2 trials over 100 total epochs. The plots show the mean and standard deviation of the time each epoch was measured at (timer started before the dataset was loaded) as well as the associated loss at each epoch. The horizontal shift between the two plots likely reflect the longer time it takes to load the model to multiple GPUs. The width of the 2 GPU curve (i.e., the total time taken for reconstruction) is roughly half that of the 1 GPU curve.Items to look into
Diagnosing issues
I've had issues with running parallelized PyTorch scripts which may not be caused by PyTorch itself, but rather stems from communication issues between NVIDIA GPUs via the NCCL (NVIDIA Collective Communications Library) backend. These issues seem to strongly depend on the exact details of how the machine is set up. To see if there's an issue with NCCL, build and run the following test from https://github.com/NVIDIA/cuda-samples:
cuda-samples/Samples/0_Introduction/simpleP2Pand/orcuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest(per the NCCL troubleshooting guide)If this test fails, or hangs for several minutes, you may have a GPU-GPU communication issue.
Another symptom of communication-related hanging is if all activated GPUs report 100% usage on
nvidia-smiand nothing seems to be happening.I've included some websites below which may be helpful for solving your issues.