Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs

Hugo_2020 · August 8, 2024, 9:10pm

Hi everyone…as the title says, the performance of the model drops when I use a cluster of GPUs.
The (custom) training is being done in vertex training service.
This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest
These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively.
I’m following this tutorial:

This is my implementation of the strategy:

tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device=“cpu:0”))

I increased the batch size in 2 and 4 GPUS
batch_size_2GPUs = batch_size_1GPUx2
batch_size_4GPUs = batch_size_1GPUx4

Is there anything else I need to do, at the code level, to have the same performance values in each case?
Thanks in advance.

Topic		Replies	Views
Unable to use multi GPU with tensorflow recommender General Discussion recommenders , gpu , help_request	1	1663	October 18, 2021
MultiWorkerMirroredStrategy General Discussion distributed-training , gpu , help_request	1	1511	January 2, 2024
Effective learning rate when using tf.distribute.MirroredStrategy (one host, multi-GPU) Keras gpu	2	143	March 18, 2024
Multi GPU and TensorFlow MirroredStrategy General Discussion distributed-training , help_request	1	649	October 4, 2024
Multi-GPU inference - am I doing it right? TensorFlow models , gpu , help_request	2	1339	April 11, 2023

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs

Related topics