TF2 Keras OOM Training ImageNet with MobileNet V2 (4-GPU)

Current Behaviour?

As part of the effort of validating the official pretrained Keras models, I tried to measure the accuracy and try training MobileNet V2 with ImageNet2012. The measure accuracy is lower than expected, which is documented separately. The training also seems problematic with OOM errors even when the batch size is reduced to 32.

The experiment was performed on a 4-GPU node with TF2/Keras. The ImageNet 2012 dataset is prepared using tfds. The multi-GPU training used tf.distribute.MirroredStrategy(). Please see the code and log for more details.

Standalone code to reproduce the issue


import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np
import os
import time

from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image

tf.keras.backend.set_image_data_format(
    data_format='channels_last'
)

# ## MobileNet V2 Smoke Test

mbv2 = MobileNetV2(weights='imagenet')
img  = image.load_img('./dog.jpeg', target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

preds = mbv2.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])


# ## Prepare ImageNet Train & Validation

# Get imagenet labels
labels_path = tf.keras.utils.get_file('ImageNetLabels.txt','https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
imagenet_labels = np.array(open(labels_path).read().splitlines())

data_dir_val  = '/home/le_user/imagenet_dataset/'
write_dir_val = '/home/le_user/imagenet_dataset_tfds_full'

# Construct a tf.data.Dataset
download_config_val = tfds.download.DownloadConfig(
    extract_dir=os.path.join(write_dir_val, 'extracted'),
    manual_dir=data_dir_val)

download_and_prepare_kwargs_val = {
    'download_dir': os.path.join(write_dir_val, 'downloaded'),
    'download_config': download_config_val,
}

def resize_with_crop(image, label):
    i = image
    i = tf.cast(i, tf.float32)
    i = tf.image.resize_with_crop_or_pad(i, 224, 224)
    i = tf.keras.applications.mobilenet_v2.preprocess_input(i)
    return (i, label)

def resize_with_crop_v3(image, label):
    i = image
    i = tf.cast(i, tf.float32)
    i = tf.image.resize_with_crop_or_pad(i, 224, 224)
    i = tf.keras.applications.mobilenet_v3.preprocess_input(i)
    return (i, label)


ds = tfds.load('imagenet2012', 
               data_dir=os.path.join(write_dir_val, 'data'),         
               split=['train', 'validation'], 
               shuffle_files=True, 
               download=False, 
               as_supervised=True,
               download_and_prepare_kwargs=download_and_prepare_kwargs_val)

AUTOTUNE = tf.data.AUTOTUNE
BATCH_SIZE_PER_REPLICA = 128
NUM_GPUS = strategy.num_replicas_in_sync


# ## Multi-GPU MB-V2 Validation & Training
ds_val_parallel = ds[1].map(resize_with_crop)
ds_val_parallel = ds_val_parallel.batch(batch_size=BATCH_SIZE_PER_REPLICA * NUM_GPUS)
ds_val_parallel = ds_val_parallel.cache().prefetch(buffer_size=AUTOTUNE)

ds_train_parallel = ds[0].map(resize_with_crop)
ds_train_parallel = ds_train_parallel.batch(batch_size=BATCH_SIZE_PER_REPLICA * NUM_GPUS)
ds_train_parallel = ds_train_parallel.cache().prefetch(buffer_size=AUTOTUNE)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    mbv2_train_parallel = keras.applications.MobileNetV2(include_top=True, 
                                                        weights='imagenet', 
                                                        classifier_activation='softmax')
    mbv2_train_parallel.trainable = True
    mbv2_train_parallel.compile(optimizer='adam',
                                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                                metrics=['accuracy'])
start_time = time.time()
result_parallel = mbv2_train_parallel.evaluate(ds_val_parallel)
print(f"--- {strategy.num_replicas_in_sync}-GPU eval took {(time.time() - start_time)} seconds ---")

print(dict(zip(mbv2_train_parallel.metrics_names, result_parallel)))

mbv2_train_parallel.fit(
    x=ds_train_parallel,
    validation_data=ds_val_parallel,
    epochs=20
)

Relevant log output


2.9.2
2.9.0
channels_first
channels_last
1/1 [==============================] - 2s 2s/step
Predicted: [('n02109961', 'Eskimo_dog', 0.35159874), ('n02114548', 'white_wolf', 0.13579218), ('n02110063', 'malamute', 0.033763986)]
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

98/98 [==============================] - 56s 394ms/step - loss: 1.7855 - accuracy: 0.6155
--- 4-GPU eval took 56.09924674034119 seconds ---
{'loss': 1.7854773998260498, 'accuracy': 0.6154599785804749}

Epoch 1/20
INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1

INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1

INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1

INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1

 120/2503 [>.............................] - ETA: 19:42 - loss: 1.9443 - accuracy: 0.5517

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
/tmp/ipykernel_103014/1560377782.py in <cell line: 1>()
----> 1 mbv2_train_parallel.fit(
      2     x=ds_train_parallel,
      3     validation_data=ds_val_parallel,
      4     epochs=20
      5 )

~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     52   try:
     53     ctx.ensure_initialized()
---> 54     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     55                                         inputs, attrs, num_outputs)
     56   except core._NotOkStatusException as e:

ResourceExhaustedError: Graph execution error:

5 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  Failed to allocate memory for the batch of component 0
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[group_deps/_681]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  Failed to allocate memory for the batch of component 0
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[div_no_nan_1/_655]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (2) RESOURCE_EXHAUSTED:  Failed to allocate memory for the batch of component 0
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[div_no_nan/ReadVariableOp/_612]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (3) RESOURCE_EXHAUSTED:  Failed to allocate memory for the batch of component 0
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[cond/output/_14/_116]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (4) RESOURCE_EXHAUSTED:  Failed to allocate memory for the batch of component 0
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_77642]

Tensorflow Version

tf 2.9.2

Custom Code

No

OS Platform and Distribution

centos rhel fedora

Python version

3.8.12

CUDA/cuDNN version

11.6

GPU model and memory

4x Tesla V100 16G

Keras

Hi @KG_Bounce

There could be the issue with GPU setup is not done properly to use in the model training as you have not installed the right CUDA version compatible to TensorFlow 2.9. TensorFlow uses its specific version of libraries to be supported.

Please try installing the CUDA 11.2 and cuDNN 8.1 for the TensorFlow 2.9 as mentioned in this tested build configuration and try executing the code again. You can also try by reducing the Batch_size or image_size to fit into memory.

Please let us know if the issue still persists. Thank you.