Skip to content

Nccl Library Unavailable on windows #33

@cdilga

Description

@cdilga

Training StyleGAN on multiple GPUs requires Nccl, which is not included on windows.
There is some custom way of reducing and updating all of the gradients across the devices which is not similar to the api's exposed by tensorflow.

This causes an error like:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainD/SumAcrossGPUs/NcclAllReduce (defined at D:\data\oliver-train-checkface\fflowhq\00005-sgan-flower-1gpu\src\dnnlib\tflib\optimizer.py:135) with these attrs: [reduction="sum", shared_name="c124", T=DT_FLOAT, num_devices=2]

There is no drop in replacement that has been found, because the api for tf generic operations like a HierachicalAllReduce which is used in Keras like in: tensorflow/tensorflow#21470
is not compatible with the nccl_ops.py interface https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops.py

Perhaps even more surprising is the fact that other ops, like: collective_ops.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops.py
do not provide drop in replacements. These ops seem to have completely different use cases as is made clear by their use in tests:
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops_test.py
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops_test.py

The line that needs to be updated or removed seems to be the following:

g = nccl_ops.all_sum(g)

This is the point at which all of the device gradients are summed together before updating each of the devices. However, higher level api's like HierarchicalAllReduce would handle this entire process, including the updating of each of the devices, but is not well suited to this use case.

@olivercoad

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions