Quantization spec for 16x8 quantization

Hi all,

I am quantizing models to 16x8 bit precision, but I cannot find any information on the actual spec for the quantized ops. The 8-bit version has a nice overview of the spec here LiteRT 8-bit quantization specification  |  Google AI Edge  |  Google AI for Developers

What I would like to know is whether there is some blanket format for all the ops in 16x8 e.g. all the ops are 8-bit symmetric weights/16-bit symmetric activations, or 8-bit symmetric weights/16-bit asymmetric activations etc.

Would anyone know where to find this information? Thanks!

Hi @Jozef_C,

Welcome to the community, thanks for the post. Sorry for the late reply.

Yes, we do have blanket format ops for W8A16 (16x8 mode) like we have for 8-bit industry standard.
Let’s now see it in some detials here - 16x8 bit blanket specification. This is the weights, activations and biases specification which is followed exactly for most of the compute heavy ops - namely Conv2D, FullyConnected, DepthwiseConv etc.

Blanket specification which you probably be looking for - WEIGHTS, ACTIVATIONS and BIASES.

Weights: int8 (8 bit integer), symetric: ZERO point forced to 0, range used is around -127 to +127.This range will help to avoid any overflow issues in our hardware accumulators area.

Activations: int16 (input/output 16bit signed int), asymetric: note that this activation can have non-zero ZERO point anywhere in the int16 range that is -32768 to + 32767 range

Biases: int64 This is different and maybe need to pay attention to as we are multiplying the 16bit by 8bit, to prevent that truncation issue or sometimes overflow issue in downscaling. So here the biases used is int64*

You must be looking for good reference documentation for the above discussed biases and blanket ops for 16x8 bit. They are here below -
Note that this is technically labeled as experimental in the documents I am providing you here below now -

~ official 16x8 bit guide - here you can see int16 activation, and int8 weight parining.
link: https://ai.google.dev/edge/litert/conversion/tensorflow/quantization/post_training_integer_quant_16x8

~ detailed reference document for litert optimisation methods - here you can see the mentions of activation and weights quantization.
link: https://ai.google.dev/edge/litert/conversion/tensorflow/quantization/post_training_quantization

Since you are interested in this specification - You may read this blog on how 16x8 is used to unleash the very high performance on NPU eg Qualcomm’s Snapdragon Elite etc. Link: Unlocking Peak Performance on Qualcomm NPU with LiteRT - Google Developers Blog

That’s it. Waiting for your reply/any more questions related to 16x8 bit quantisation and blanket format.

Keep us posted on Google AI for Developers Forum.
Thanks.