Slow training performance with batch_size = 1 when compared to other library

Dear TensorFlow Community,

I would like to share my experience after using TensorFlow for quite sometime now. I am using TensorFlow to build and train an Autoencoder for a regression task. For my specific problem (300 features and 25000 samples), training an Autoencoder with Adadelta and batch_size = 1 yields desired quality of results. A reproducible code is attached below to describe my following problem statement.

Problem:
The issue I am facing is the longer training duration of the model. The training was carried out in an isolated training environment via Terminal to ensure TensorFlow Graph execution. During the training, the CPU utilization was only around 20%. As the training took longer, I was curious whether other deep-learning libraries also take same amount of time. Hence, I used H2O to train an autoencoder with identical model settings used in my TensorFlow model. Surprisingly, the training took significantly less amount of time than TensorFlow for same number of epochs. Also, the CPU utilization stayed above 90% for the whole training duration. This comparison was done in the same isolated training environment in the same machine. Training with default batch_size = 32 (or even higher) in TensorFlow was faster than H2O but the quality of results was not satisfactory for my dataset. Hyperparameter optimizations were also done for my model but the results were not as satisfactory as training with Adadelta and batch_size = 1.

Explanation and Expectation:
It is clear that using small batch size increases training duration due to frequent weight updates in the backpropagation. As I read through the H2O Deep Learning Documentation for autoencoders, I find that it also uses mini batch size 1. Hence, it is my expectation that when two state-of-the-art deep learning libraries use identical model and training parameters, the training duration would also be approximately same. But yes, behind the hood, the technical implementation of the training procedures might be different and hence, this could have caused this significant difference in time duration. This leads me to think of optimization procedures for TensorFlow that could lead to improved training duration.

Optimization tips that I tried:
I tried various optimization tips available in the TensorFlow blogs and internet to speed up the training like using tf.data API, using GPU, setting inter- and intra threads as mentioned in Intel blogs and finally training in Widows and Linux. There was no improvement in training duration in any of those experiments. Rather, training duration got worsened in few experiments.

For the minimum reproducible example given below, training duration in my machine for 10 epochs for the TensorFlow model is around 4 minutes (~ 25 seconds/epoch) where as H2O takes only around 1 minute. This difference stays significant even if the epochs is increased to 100 or 200.

Hence, I am looking for any suggestions/help to solve this problem. I am curious whether any community members also faced similar experiences with TensorFlow and how they handled such situations. I would be very grateful for any hints or suggestions towards improving training duration in TensorFlow for my use case i.e. batch_size =1, as I wish to continue using TensorFlow besides this issue.

PS: I am a machine learning practitioner not affiliated with any of the deep learning libraries mentioned in this post and I am purely looking out for help from the TensorFlow community/developers.

System configuration:
OS: Windows
CPU: Intel i7, 1 Socket, 14 physical cores (20 logical)
RAM: 32GB
GPU: NVIDIA A2000 4GB

Software Versions:
TensorFlow: 2.15.0
H2O: 3.46.0.2
Python: 3.10.14

Code for TensorFlow model:


import time
import tensorflow as tf
from tensorflow.keras.layers import Dense

train_array = tf.random.normal(shape=(25000,300))

regularizer = tf.keras.regularizers.L1L2(l1=1e-3, l2=1e-3)
tf_model = tf.keras.Sequential()
tf_model.add(Dense(200,activation='tanh', kernel_regularizer = regularizer, bias_regularizer = regularizer))
tf_model.add(Dense(100,activation='tanh', kernel_regularizer = regularizer, bias_regularizer = regularizer))
tf_model.add(Dense(200,activation='tanh', kernel_regularizer = regularizer, bias_regularizer = regularizer))
tf_model.add(Dense(300,activation='tanh', kernel_regularizer = regularizer, bias_regularizer = regularizer))

optim = tf.keras.optimizers.Adadelta(learning_rate=1.0, rho=0.9, epsilon=1e-10)
tf_model.compile(optimizer=optim, loss='mse', metrics='mse')

start = time.monotonic()
print(f"Start Time (monotonic) = {start}")

tf_model.fit(x=train_array,y=train_array,epochs=10,batch_size=1,verbose=0)

duration = time.monotonic() - start
print(f"Training Duration = {duration:.0f} seconds")

Code for H2O model:

import numpy as np
import time
import h2o
from h2o.estimators import H2OAutoEncoderEstimator
h2o.init(verbose=False)

train_array = np.random.normal(size=(25000,300))

train_frame = h2o.H2OFrame(train_array)

start = time.monotonic()
print(f"Start Time (monotonic) = {start}")

h2o_ae = H2OAutoEncoderEstimator(
    training_frame=train_frame,
    autoencoder=True,
    activation="tanh", 
    epsilon=1e-10,  
    rho=0.9, 
    input_dropout_ratio=0,  
    stopping_metric="AUTO",  
    stopping_tolerance=0,  
    epochs=10,
    hidden=[200,100,200],
    l1=1e-3,
    l2=1e-3,
    standardize=False,
)

h2o_ae.train(training_frame=train_frame, verbose=False)

duration = time.monotonic() - start
print(f"Training Duration = {duration:.0f} seconds")
1 Like