I have a model trained on GTSRB dataset that I want to prune. I applied the following pruning schedule
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0,
final_sparsity=0.99,
begin_step=0,
end_step=end_step,
frequency=1)
}
and run the pruning fit method for 100 epochs with the final goal of demonstrating that a highly pruned model has faster execution time during inference. I converted both the initial trained model and the final pruned model to Tensorflow Lite using the Interpreter and then recorded the average inference time over a span of 100 executions
x_test_norm, y_test = load_data()
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="simplified.tflite")
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test model on input data.
interpreter.set_tensor(input_details[0]['index'], x_test_norm[0])
deltas = []
for _ in range(100):
begin = datetime.datetime.now()
interpreter.invoke()
end = datetime.datetime.now()
deltas.append(end-begin)
deltas = [x.total_seconds() for x in deltas]
print(sum(deltas)/len(deltas))
Unfortunately the record times are almost identical. My questions are: am I doing something wrong during the pruning phase? Shouldn’t the pruned model use sparse math operations and thus be much faster than the non pruned model? Is there a way to force Tensorflow lite to use sparse operators to decrease the inference time?
The two TensorflowLite models are being run on a Jetson Nano with Tensorflow version 2.4.1 which is not the latest version but it is fully supported.