Single-sample evaluation (prediction/inference) speed for simple MLP slower in Keras compared to skl

Hello,
Can someone help me speed understand why a simple keras MLP binary classifier evaluates (predicts) significantly slower for single samples than the skl MLPClassifier, despite it training much quicker than it. I already figured out that I should NOT use the keras model’s predict method, since that has a lot of overheads to convert the single sample into a tf dataset etc etc. But even with a direct evaluation I am finding much worse performance, and I hope someone can help me speed this up.
My application requires single-sample evaluation, I’m afraid … I cannot do batch evaluation.

Here’s a reproducer:

import numpy as np
import sklearn as skl
import sklearn.neural_network
import tensorflow as tf
# random events, just to test training time
nEvents = 100000
nFeatures = 2
train_X = np.random.uniform(size=(nEvents,2))
train_y = np.concatenate( (np.ones(nEvents//2), np.zeros(nEvents//2)) )

Compare training times:

%%time
skl_model = skl.neural_network.MLPClassifier(random_state=1,hidden_layer_sizes=(128,128,128,128),alpha=0,batch_size=512,verbose=True).fit(train_X, train_y)

Produces: Wall time: 26.5 s and printout say it ran for 13 epochs (iterations)
Compare to:

%%time
tf.keras.utils.set_random_seed(1)
tf_model = tf.keras.Sequential([
    tf.keras.Input((nFeatures,)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1,activation='sigmoid')
])
tf_model.compile(loss='binary_crossentropy', optimizer='adam')
history = tf_model.fit(train_X, train_y, epochs=200, batch_size=512, verbose=0, callbacks=[tf.keras.callbacks.EarlyStopping(monitor='loss',patience=10)])
print(f"Ran for {len(history.epoch)} epochs")

Produces: Wall time: 17.9 s and it ran for 29 epochs … so keras/tf is faster at training (even when it runs more epochs)! great!

Now the problem … evaluation:

%time for i in range(10000): skl_model.predict_proba(np.array([(0.5,0.5)]))
%time for i in range(10000): tf_model(np.array([(0.5,0.5)]),training=False)

Produces:

CPU times: user 3.22 s, sys: 2.67 s, total: 5.89 s
Wall time: 3.18 s
CPU times: user 56 s, sys: 627 ms, total: 56.6 s
Wall time: 56.7 s

so keras/tf is much much slower here.
How can I close this gap?

Thanks!

Hi @willb, Tensorflow models are good at predicting batch data instead of predicting a single sample at a time. Instead of passing a single sample if you pass a batch of samples you can observe better performance.

%time for i in range(10000): skl_model.predict_proba(np.array([(0.5,0.5)]))
%time for i in range(1000): tf_model(np.array([[0.5, 0.5]]*10), training=False)
#OutPut
CPU times: user 3.29 s, sys: 3.03 s, total: 6.32 s
Wall time: 3.82 s
CPU times: user 6.09 s, sys: 59.9 ms, total: 6.15 s
Wall time: 6.2 s

Thank You.

Dear @Kiran_Sai_Ramineni ,

Unfortunately you missed an important point in my question:

I was hoping there might be some ways to optimize the inference once I have finished with the training. I read about things like tflite and “freezing” but I was hoping someone can help me cut through the noise and get a simple example of how to optimize this.

Thanks

Hi @willb, Apologies for that. I have tried to predict using TF model with a single sample in colab and I find a small difference between tf and sklearn model which is about 3 sec. could you please let me know in which environment you are executing the code and also provide the details about Tensorflow and keras versions. Thank You.

The code I posted above was executed in colab.

Have you got a link to your notebook? Here is mine: Google Colab

Hi @willb, I again tried to reproduce the issue in the colab and got the difference you are facing. I have tried to make the predictions with XLA enable and can see there is a decrease in time

CPU times: user 2.61 s, sys: 2.11 s, total: 4.72 s
Wall time: 2.44 s
#keras model with xla
CPU times: user 8.03 s, sys: 361 ms, total: 8.39 s
Wall time: 8.32 s

please refer to this gist for working code example. Thank You.

Thanks @Kiran_Sai_Ramineni I can confirm this makes a big performance boost compared to what I had before. It’s still a shame that it’s about 4x slower than the sklearn mlp classifier though. If there are any further optimizations that are possible here, I would be very happy to hear them.