Hi,
I was using 4xTesla T4 with 16Gigs of memory to train my model with batch size of 96. Now I have migrated the same model, and exact the same dataset to new server containing 8xNvidia L4 with 24Gigs of memory. I am getting OOM with the same batch size. I experimented with lower, and only 48 batch size was not giving me OOM. When I do nvidia-smi I am getting 21GB our of 24GB used with BS of 48, but on T4 I was getting 14GB out of 16GB.
I was expecting to be able do run training with even bigger BS compared to T4. Can somebody explain me how can I setup L4s to run bigger BS?
I am using Tensorflow 2.16.1. CUDA 12.
Also, model is using distributed strategy, mixed precision, and custom train step.
Thank you in advance!