Hello,
I am trying to figure out how exactly quantized dense layers are calculated. I this block detailing the different computations needed.. I wanted to reimplement the code but with the parameters extracted from a TFLite Model. However, I am not seeing the expected results. Here is my code:
import tensorflow as tf
from tensorflow import keras
import numpy as np
i = 16
j = 48
k = 32
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(i, j), batch_size=1))
model.add(tf.keras.layers.Dense(k, activation=None, use_bias=False))
def representative_data_gen():
dataset = [
np.array(
np.random.randint(-10, 10, size=(i, j)),
dtype=np.float32,
)
for s in range(10)
]
for input_value in dataset:
# Model has only one input so each data point has one element.s
yield [input_value]
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS, tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()
# Run the model with TensorFlow Lite
interpreter = tf.lite.Interpreter(model_content=tflite_model_quant)
interpreter.allocate_tensors()
tensor_details = interpreter.get_tensor_details()
out_details = interpreter.get_output_details()
input_tensor = np.squeeze(interpreter.get_tensor(input_idx)) #shape: [i, j]
weight_tensor = np.squeeze(interpreter.get_tensor(weight_idx)) #shape: [k, j]
out_tensor = np.squeeze(interpreter.get_tensor(out_idx))
reduction_dim_size = weight_tensor.shape[1]
zp_input = tensor_details[input_idx]["quantization_parameters"]["zero_points"][0]
q_input = tensor_details[input_idx]["quantization_parameters"]["scales"][0]
zp_weight = tensor_details[weight_idx]["quantization_parameters"]["zero_points"][0]
q_weight = tensor_details[weight_idx]["quantization_parameters"]["scales"][0]
zp_out = tensor_details[out_idx]["quantization_parameters"]["zero_points"][0]
q_out = tensor_details[out_idx]["quantization_parameters"]["scales"][0]
result = (zp_out +
((q_input*q_weight)/q_out) +
np.matmul(input_tensor, weight_tensor.T)
- zp_weight * np.sum(input_tensor.astype(np.int32), axis=1, keepdims=True)
- zp_input * np.sum(weight_tensor.T.astype(np.int32), axis=0, keepdims=True)
+ reduction_dim_size*zp_input*zp_weight)
result = np.round(result, decimals=0)
result = np.clip(result, a_min=-128, a_max=127)
result = result.astype(np.int8)
When I run it with this, the result
matrix differentiates from out_tensor
, all values saturate. What am I doing wrong here? Is it maybe related to matmul using the transpose of the weight matrix, should I handle this differently?
The largest values come from the zp_input * np.sum(weight_tensor.T.astype(np.int32), axis=0, keepdims=True)
term, so I suspect that this one is doing something wrong, but I don’t see what would be problem here. Does anyone have any ideas what I am doing wrong here?