Extract Quantization parameters used for convolution layer weights and perform layer by layer inference in C++

Hi everyone,

The project aim is to extract the weights from a trained and quantized tflite model and perform inference using C++ script layer by layer instead of using tensorflow library for the inference.

I am able to extract the model details and its layer weights using the functions. Also, by setting the flag preserver_all_tensors=True I can store the intemediate tensors results.

interpreter = tf.lite.Interpreter(model_path=model_path, experimental_preserve_all_tensors=True)

tensor_details = interpreter.get_tensor_details()

I run the inference for 1st time using tensorflow library in a python script and get the details (input tensor, weight tensor, output tensor, quantiztion parameter) for each layer.
Then using those extracted weights and with the same input I try to run the inference again but now with my implementations of layer operations (convolution, maxpool, dense) instead of tensorflow library.
I implemented convolution, dense, maxpool, flatten functions in my C++ script. Then, I take the preprocessed input and pass it along the functions to perform the layer by layer inference.
I am able to get the same result when I run the inference for both the scripts (using tensorflow library and with my custom implementations) for operations (maxpool and dense) but for convolution it is different.

My convolution function implementation in C++ gives different result compared to the tensorflow library. I think the difference is because the quantization parameters (scaling and zero point) are not provided for the convolution layer weights.
From the output of tensor->get_details(), the quantization parameter for convolution weights tensor is (0, 0) which I believe should be wrong since, with scaling=0 makes the entire kernel to be 0 when multiplied with quantization parameter.

Can anyone please provide me with some insights on

  1. how to implement the convolution function as done in tensorflow library.
  2. is the convolution weights tensor scaling factor should be calculated or is it already defined (if so where to find it)?

The model details are below. The convolution and dense layer details is given below. The dense layer kernel weights have quantization parameter (0.003373592859134078, 0) but the convolution layer kernel weights have quantization parameter as (0, 0).

Layer/Op Input Tensors Output Tensors Input Shape Output Shape Quantization (scale, zero_point) Datatype
Conv2D (conv2d) T#0 (input), T#11 (kernel), T#10 (bias) T#12 (conv2d output) [1, 28, 28, 1] [1, 26, 26, 32] (input: 0.00392157, -128), (output: 0.00292, -128) int8 (input/output), int8 (kernel), int32 (bias)
Dense (dense) T#17 (flatten output), T#5 (kernel), T#4 (bias) T#18 (dense output) [1, 576] [1, 64] (input: 0.02445, -128), (output: 0.04887, -128) int8 (input/output), int8 (kernel), int32 (bias)

The tensor details are

Layer Name Tensor Index Input Shape/Weights Shape Output Shape Quantization (scale zero point) Datatype
tfl.pseudo_qconst8 10 [32] - (0.0, 0) numpy.int32
tfl.pseudo_qconst9 11 [32, 3, 3, 1] - (0.0, 0) numpy.int8
sequential_1/conv2d_1 12 [1, 26, 26, 32] - (0.0029200201388448477, -128) numpy.int8
sequential_1/flatten_1 17 [1, 576] - (0.024446312338113785, -128) numpy.int8
sequential_1/dense_1 18 [1, 64] - (0.048872217535972595, -128) numpy.int8
tfl.pseudo_qconst2 4 [64] - (8.247190271504223e-05, 0) numpy.int32
tfl.pseudo_qconst3 5 [64, 576] - (0.003373592859134078, 0) numpy.int8