Memory usage on Nvidia L4 vs Tesla T4

Dj_Hi · February 22, 2025, 3:07pm

Hi,

I was using 4xTesla T4 with 16Gigs of memory to train my model with batch size of 96. Now I have migrated the same model, and exact the same dataset to new server containing 8xNvidia L4 with 24Gigs of memory. I am getting OOM with the same batch size. I experimented with lower, and only 48 batch size was not giving me OOM. When I do nvidia-smi I am getting 21GB our of 24GB used with BS of 48, but on T4 I was getting 14GB out of 16GB.

I was expecting to be able do run training with even bigger BS compared to T4. Can somebody explain me how can I setup L4s to run bigger BS?

I am using Tensorflow 2.16.1. CUDA 12.
Also, model is using distributed strategy, mixed precision, and custom train step.

Thank you in advance!

Topic		Replies	Views
Memory issue when start fit a model Keras nvidia , tfconfig , tfkeras	1	51	July 18, 2024
CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY) General Discussion models , datasets , gpu , tensorflow	4	1393	November 11, 2023
tf.data.Dataset with tf.distribute General Discussion datasets , distributed-training , gpu	1	500	October 4, 2024
Problems with training a model on a dataset that doesn't fit into RAM memory General Discussion python , tfcore , tensorflow-data , tf_function	3	962	November 29, 2023
Dual GPU, one GPU fills up the other doesn't General Discussion models , gpu	2	482	September 6, 2024

Memory usage on Nvidia L4 vs Tesla T4

Related topics