Post-training quantization. Where to learn more?

Hi
I am using post-training quantization to create 8-bit quantized tensorflow models. The documentation is simple to follow and I achieve results that I am happy with.

I want to learn more about post-training quantization with regards to what actually happens under the hood. I have read https://arxiv.org/pdf/1712.05877.pdf, which i found as a reference for the quantization scheme. My impression is that this mostly describes quantization aware training, but I could be wrong.

Do you have any recommendations for where I can learn more about post-training quantization? I assume the source code is complex to read. I need to be able to understand the algorithm as part of my master thesis.

My current limited understanding is that parameters such as weights are fixed. These can be quantized by using min and max values. Values such as activations and input to the model are dynamic, and the min/max values of these needs to be estimated using a representative dataset and running inference.

Hi @aqaaqa Thank you for showing interests in post-training quantization.

Here are some recommended documentations for TFLite post-training quantization:

https://www.tensorflow.org/lite/performance/model_optimization

Hi @battery
Thank you for the helpful links.
There are some vague points for me in these links. Consider a model with a single Convolution layer followed by ReLU activation (or any other activation) and I want to apply post-training Dynamic range quantization on that. As it has been mentioned in those links, in DRQ, weights are going to be saved in int8 and activations will be saved in floating point. Also, it has mentioned that Activations will be quantized and dequantized on the fly for the int8 supported operations. My questions:
1- What does it mean by “The activations are always stored in floating point” in those links? Does it mean that the output tensor of the activation function stored in floating point? or there is something else related to the activation that sored in floating point.
2- As the input of the model is in floating point, I suppose that the Convolution computation will be done in floating point and it dequantized weights to floating point (or probably there is no need to dequantize the weights)? Can you give me some information about these computations?
3- Is the output of the Convolution computation (before applying activation) in floating point?

Someone reply please :weary: :sob: