How to train a model with huge data and limited GPU memory using tf.data.Dataset APIs

Hi I am using TensorFlow Keras deep learning model to train my data.
I have GPU instance memory of 16 GB, and After train and validation split data is of 16GB, I am not able to train my data because of memory limitations.
Here is my estimator code, and python script,

Estimator,

Python training script,

import argparse, os
import numpy as np
import json

import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import multi_gpu_model
from sklearn.metrics import classification_report, accuracy_score

if __name__ == '__main__':
        
    parser = argparse.ArgumentParser()

    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--learning-rate', type=float, default=0.01)
    parser.add_argument('--batch-size', type=int, default=128)
    parser.add_argument('--sequence-length', type=int, default=60)
    parser.add_argument('--class-weight', type=str, default='{0:1,1:1}')
    parser.add_argument('--gpu-count', type=int, default=os.environ['SM_NUM_GPUS'])
    parser.add_argument("--model_dir", type=str)
    parser.add_argument("--sm-model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--val', type=str, default=os.environ['SM_CHANNEL_VAL'])
    parser.add_argument("--current-host", type=str, default=os.environ.get("SM_CURRENT_HOST"))
    
    args, _ = parser.parse_known_args()
    epochs     = args.epochs
    lr         = args.learning_rate
    batch_size = args.batch_size
    class_weight = eval(args.class_weight)
    gpu_count  = args.gpu_count
    model_dir  = args.sm_model_dir
    training_dir   = args.train
    validation_dir = args.val
    sequence_length = args.sequence_length
        
    # load data
    X_train = np.load(os.path.join(training_dir, 'train.npz'))['X']
    y_train = np.load(os.path.join(training_dir, 'train.npz'))['y']
    X_val  = np.load(os.path.join(validation_dir, 'val.npz'))['X']
    y_val  = np.load(os.path.join(validation_dir, 'val.npz'))['y']
        
    #create model
    model = Sequential()
    model.add(LSTM(32, input_shape=(X_train.shape[1:]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(BatchNormalization()) 

    model.add(LSTM(32))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())

    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.2))

    model.add(Dense(1, activation='sigmoid'))
  
    #if gpu_count > 1:
    #    model = multi_gpu_model(model, gpus=gpu_count)
            
    METRICS = [
      tf.keras.metrics.TruePositives(name='tp'),
      tf.keras.metrics.FalsePositives(name='fp'),
      tf.keras.metrics.TrueNegatives(name='tn'),
      tf.keras.metrics.FalseNegatives(name='fn'), 
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
      tf.keras.metrics.AUC(name='auc'),
      tf.keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
    ]  
    
    # compile model
    model.compile(loss=tf.keras.losses.binary_crossentropy,
                  optimizer=Adam(lr=lr, decay=1e-6),
                  metrics=METRICS)
    
    # Slicing using tensorflow apis
    tf_trainX_dataset = tf.data.Dataset.from_tensor_slices(X_train)
    tf_trainY_dataset = tf.data.Dataset.from_tensor_slices(y_train)
    
    # Train model
    model.fit(tf_trainX_dataset, 
              tf_trainY_dataset, 
              batch_size=batch_size,
              epochs=epochs,
              class_weight=class_weight,
              validation_data=(X_val,y_val), 
              verbose=2)

I am new to TensorFlow objects, could you please help me how can I train my data in chunks using the GPU instance’s memory efficiently?

Thanks in advance.

Hi @Priyanshi_Jajoo,Once you have prepared your dataset you can convert your dataset set in batches. For example,

dataset = tf.data.Dataset.range(8)
batch_dataset = dataset.batch(3)

The above code will convert your data into batches of 3 elements each and you can pass this batch_dataset to train your model. Thank You.

1 Like

HI @Kiran_Sai_Ramineni Thanks for your response, could you please elaborate on how range() method would be helpful here instead of from_tensor_slices()?

I believe the use of tf.data.Dataset.range in @Kiran_Sai_Ramineni’s code example is merely for illustration. It’s just a quick way of creating a dataset. You will want to continue to use your own dataset (but batch it).

1 Like

Thanks for clarifying.

Hi @Priyanshi_Jajoo, As @rcauvin mentioned i have used tf.data.Dataset.range for illustration, in the same way you make batches from the dataset created by using tf.data.Dataset.from_tensor_slices. Thank You.