Hi everyone…as the title says, the performance of the model drops when I use a cluster of GPUs.
The (custom) training is being done in vertex training service.
This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest
These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively.
I’m following this tutorial:
This is my implementation of the strategy:
tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device=“cpu:0”))
I increased the batch size in 2 and 4 GPUS
batch_size_2GPUs = batch_size_1GPUx2
batch_size_4GPUs = batch_size_1GPUx4
Is there anything else I need to do, at the code level, to have the same performance values in each case?
Thanks in advance.