Current Behaviour?
As part of the effort of validating the official pretrained Keras models, I tried to measure the accuracy and try training MobileNet V2 with ImageNet2012. The measure accuracy is lower than expected, which is documented separately. The training also seems problematic with OOM errors even when the batch size is reduced to 32.
The experiment was performed on a 4-GPU node with TF2/Keras. The ImageNet 2012 dataset is prepared using tfds
. The multi-GPU training used tf.distribute.MirroredStrategy()
. Please see the code and log for more details.
Standalone code to reproduce the issue
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np
import os
import time
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image
tf.keras.backend.set_image_data_format(
data_format='channels_last'
)
# ## MobileNet V2 Smoke Test
mbv2 = MobileNetV2(weights='imagenet')
img = image.load_img('./dog.jpeg', target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
preds = mbv2.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])
# ## Prepare ImageNet Train & Validation
# Get imagenet labels
labels_path = tf.keras.utils.get_file('ImageNetLabels.txt','https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
imagenet_labels = np.array(open(labels_path).read().splitlines())
data_dir_val = '/home/le_user/imagenet_dataset/'
write_dir_val = '/home/le_user/imagenet_dataset_tfds_full'
# Construct a tf.data.Dataset
download_config_val = tfds.download.DownloadConfig(
extract_dir=os.path.join(write_dir_val, 'extracted'),
manual_dir=data_dir_val)
download_and_prepare_kwargs_val = {
'download_dir': os.path.join(write_dir_val, 'downloaded'),
'download_config': download_config_val,
}
def resize_with_crop(image, label):
i = image
i = tf.cast(i, tf.float32)
i = tf.image.resize_with_crop_or_pad(i, 224, 224)
i = tf.keras.applications.mobilenet_v2.preprocess_input(i)
return (i, label)
def resize_with_crop_v3(image, label):
i = image
i = tf.cast(i, tf.float32)
i = tf.image.resize_with_crop_or_pad(i, 224, 224)
i = tf.keras.applications.mobilenet_v3.preprocess_input(i)
return (i, label)
ds = tfds.load('imagenet2012',
data_dir=os.path.join(write_dir_val, 'data'),
split=['train', 'validation'],
shuffle_files=True,
download=False,
as_supervised=True,
download_and_prepare_kwargs=download_and_prepare_kwargs_val)
AUTOTUNE = tf.data.AUTOTUNE
BATCH_SIZE_PER_REPLICA = 128
NUM_GPUS = strategy.num_replicas_in_sync
# ## Multi-GPU MB-V2 Validation & Training
ds_val_parallel = ds[1].map(resize_with_crop)
ds_val_parallel = ds_val_parallel.batch(batch_size=BATCH_SIZE_PER_REPLICA * NUM_GPUS)
ds_val_parallel = ds_val_parallel.cache().prefetch(buffer_size=AUTOTUNE)
ds_train_parallel = ds[0].map(resize_with_crop)
ds_train_parallel = ds_train_parallel.batch(batch_size=BATCH_SIZE_PER_REPLICA * NUM_GPUS)
ds_train_parallel = ds_train_parallel.cache().prefetch(buffer_size=AUTOTUNE)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
mbv2_train_parallel = keras.applications.MobileNetV2(include_top=True,
weights='imagenet',
classifier_activation='softmax')
mbv2_train_parallel.trainable = True
mbv2_train_parallel.compile(optimizer='adam',
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
metrics=['accuracy'])
start_time = time.time()
result_parallel = mbv2_train_parallel.evaluate(ds_val_parallel)
print(f"--- {strategy.num_replicas_in_sync}-GPU eval took {(time.time() - start_time)} seconds ---")
print(dict(zip(mbv2_train_parallel.metrics_names, result_parallel)))
mbv2_train_parallel.fit(
x=ds_train_parallel,
validation_data=ds_val_parallel,
epochs=20
)
Relevant log output
2.9.2
2.9.0
channels_first
channels_last
1/1 [==============================] - 2s 2s/step
Predicted: [('n02109961', 'Eskimo_dog', 0.35159874), ('n02114548', 'white_wolf', 0.13579218), ('n02110063', 'malamute', 0.033763986)]
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
98/98 [==============================] - 56s 394ms/step - loss: 1.7855 - accuracy: 0.6155
--- 4-GPU eval took 56.09924674034119 seconds ---
{'loss': 1.7854773998260498, 'accuracy': 0.6154599785804749}
Epoch 1/20
INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 158 all-reduces with algorithm = nccl, num_packs = 1
120/2503 [>.............................] - ETA: 19:42 - loss: 1.9443 - accuracy: 0.5517
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
/tmp/ipykernel_103014/1560377782.py in <cell line: 1>()
----> 1 mbv2_train_parallel.fit(
2 x=ds_train_parallel,
3 validation_data=ds_val_parallel,
4 epochs=20
5 )
~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
52 try:
53 ctx.ensure_initialized()
---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
55 inputs, attrs, num_outputs)
56 except core._NotOkStatusException as e:
ResourceExhaustedError: Graph execution error:
5 root error(s) found.
(0) RESOURCE_EXHAUSTED: Failed to allocate memory for the batch of component 0
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[group_deps/_681]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: Failed to allocate memory for the batch of component 0
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[div_no_nan_1/_655]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(2) RESOURCE_EXHAUSTED: Failed to allocate memory for the batch of component 0
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[div_no_nan/ReadVariableOp/_612]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(3) RESOURCE_EXHAUSTED: Failed to allocate memory for the batch of component 0
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[cond/output/_14/_116]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(4) RESOURCE_EXHAUSTED: Failed to allocate memory for the batch of component 0
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[RemoteCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[IteratorGetNextAsOptional]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_77642]
Tensorflow Version
tf 2.9.2
Custom Code
No
OS Platform and Distribution
centos rhel fedora
Python version
3.8.12
CUDA/cuDNN version
11.6
GPU model and memory
4x Tesla V100 16G