Train a model built-on "custom dataloader" with multi-GPU support

Abhijeet · February 25, 2022, 3:31pm

I am interested to scale the existing model with a “custom data loader” built on tensorflow.keras.utils.Sequence for multi-GPU support. Can anybody share a few thoughts?
The “custom data loader” is built on tensorflow.keras.utils.Sequence as opposed to tf.dataset because of the nature of the dataset.

Following code is a minimal example.

The above example uses multiprocessing with a "custom data loader " on a single node with multiple CPUs. Is there a way I can scale it for a multi-GPU mirrored strategy with a “custom data loader” like in the example?

I dig a bit but most of the examples in official documentation use tf.dataset for multi-GPU training which makes little complicated to adapt.

lgusm · February 28, 2022, 12:42pm

HI,

I didn’t understand exactly what you want but let me add my 2 cents.
if you want to train the model using multi-gpu, you might look into distribution strategies not on data loaders

on the data side of things, you might want to be as efficient as possible with your cpu and have a very good pipeline like you can see here: Better performance with the tf.data API | TensorFlow Core

(sorry if I misunderstood your question)

Abhijeet · February 28, 2022, 4:29pm

To use a distribution strategy, data must be pipelined in a distributed way. Most of the examples shown, used the tf.data API also uses well-known datasets from the TensorFlow datasets. But if the dataset is built from a custom loader like above (using tensorflow.keras.utils.Sequence), then things may change during distributing data across multiple GPUs. I just want to know what’s the right way to do those things.
One way to do this is tf.data.Dataset.from_generator but something not working out

seq_iter_tr = lambda: (s for s in MnistSequence(x_train, y_train, batch_size, 'TRAIN'))
    seq_iter_ts = lambda: (s for s in MnistSequence(x_test, y_test, batch_size, 'VAL'))

    seq_train = tf.data.Dataset.from_generator(seq_iter_tr,output_signature=(
        tf.TensorSpec(shape=(batch_size,28, 28, 1) , dtype=tf.string),
        tf.TensorSpec(shape=(batch_size, num_classes), dtype=tf.string )))
    seq_test = tf.data.Dataset.from_generator(seq_iter_ts, output_signature=(
        tf.TensorSpec(shape=(batch_size,28, 28, 1) , dtype=tf.string),
        tf.TensorSpec(shape=(batch_size,num_classes), dtype=tf.string )))

Getting error for shape

Topic		Replies	Views
Multi GPU and TensorFlow MirroredStrategy General Discussion distributed-training , help_request	1	656	October 4, 2024
Distributed training with data dictionary input General Discussion models , distributed-training , gpu	1	1181	September 10, 2024
How to train a model with huge data and limited GPU memory using tf.data.Dataset APIs Keras models , gpu	5	610	July 14, 2023
tf.data.Dataset with tf.distribute General Discussion datasets , distributed-training , gpu	1	505	October 4, 2024
Distribute Strategy with Keras Custom Loops General Discussion distributed-training , keras , help_request	6	2008	September 22, 2021

Train a model built-on "custom dataloader" with multi-GPU support

Related topics