I’m using MirroredStrategy
to distribute training and related computation over multiple GPUs. I experienced that my code got stuck at a strategy.gather()
call. Upon inspection I noticed that the input PerReplica
tensors to the gather
function, returned from an earlier strategy.run()
call, were all placed on GPU:0 (while I am using 4 GPUs). Further, the backing_device
for all PerReplica
tensors shows the correct device (GPU:0 to GPU:3).
How could this occur? Why are all tensors, as output by strategy.run()
placed on GPU:0? I’m not using any special cross_device_ops
.
Is the fact that all PerReplica
tensors are put on the first device the cause that the strategy.gather()
function gets stuck?
Update: I checked that the CollectiveAllReduce
cross device ops are used with the MirroredStrategy
, using NCCL. Here is a log statement that further backs this up:
INFO : tensorflow::_batch_all_gather : Collective batch_all_gather: 1 all-gathers, num_devices = 4, group_size = 4, implementation = CommunicationImplementation.NCCL,