Full integer quantization

Hi all,

I’ve tried to apply full integer quantization optimization (uint8), using this guide:
https://ai.google.dev/edge/litert/models/post_training_integer_quant
But my output log is not encouraging:


As it can be seen from the log, “full_quantize” is 0 which probably means that quantization is not successful?
“input_inference_type” and “output_inference_type” are as expected.

What is the usual error when applying this kind of quantization.

The weird thing is that the optimized model still have good accuracy when evaluating on the PC. So I’ve tried to deploy it on MCU. I am getting the same scale and zero points and I am converting input and output data the same way, but the results of the inference is much worse. Dropping form 97% accuracy to 65%.

Hi, @Sreten_Jovicic
I apologize for the delay in my response, thank you for bringing this issue to our attention if possible could you please help us with your Github repo or minimal Google colab notebook to reproduce the same behavior from our end along with your model(if you’re using your custom model ) ?

Convert using integer-only quantization

To quantize the input and output tensors and make the converter throw an error if it encounters an operation it cannot quantize by using below script [Ref-1] and Use tf.int8 for both input and output types when targeting microcontrollers

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set input/output types (use int8 for TFLite-Micro compatibility)
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model_quant = converter.convert()

Meanwhile, You can use LiteRT Model Analyzer API helps you analyze models in LiteRT format please refer this official documentation

Thank you for your cooperation and patience.

Hi, Rahul.
Thank you for your response.

I am using LSTM model from Keras:

model = tf.keras.Sequential([
        LSTM(units=50, input_shape=(sequence_length, nb_features), return_sequences = True, unroll = True),
        Dropout(0.2),
        LSTM(units=25, return_sequences = False, unroll = True),
        Dropout(0.2),
        Flatten(),
        Dense(units=nb_out, activation='sigmoid')
    ])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

history = model.fit(seq_array, label_array, epochs=100, batch_size=200, validation_split=0.05, verbose=2,
          callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=0, mode='min'),
                     keras.callbacks.ModelCheckpoint(model_path, monitor='val_loss', save_best_only=True, mode='min', verbose=0)]
          )
print(history.history.keys())

Then I am using this to convert to tflite and apply optimization:

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(seq_array_test_last).batch(1).take(90):
    yield [input_value]

model = tf.keras.models.load_model(tflite_model_path, compile=False)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflm_opt_model = converter.convert()

with tf.io.gfile.GFile(tflite_opt_model_path, 'wb') as f: # Output destination
   f.write(tflm_opt_model)

I will try with int8 (instead of uint8) for input/output as well.

I am using TF 2.19 and latest TFLM.

Sorry for not posting the complete script, I am not sure how much of it I can disclose at the moment.

Output of the conversion you can see in my original post.

In mean time I’ve tried Model Analyzer API, as you suggested. I’ve run it alongside the conversion without optimization (for same model) for comparison.
The Analyzer log is quit long, this is just a small snippet at the very end:

Hi, Rahul.

I’ve tried the quantization optimization using INT8 instead of UINT8 as you suggested.
Now I am sure that the problem is on the TFLM and/or MCU side.

I’ve confirmed that quantized models (both UINT8 and INT8) are giving good results om my PC. I’ve also double checked that my PC script and my Firmware code are calculating quantization and dequantization on inputs and outputs the same way.

But, unfortunately, MCU still calculates predictions differently and produces much worse results.

And one more thing, I’ve noticed that predictions of the quantized model are changing on MCU each time I recompile. Which is something I didn’t noticed on the models without quantization.