Issue Replicating TF-Lite Conv2D Quantized Inference Output

Issue Replicating TF-Lite Conv2D Quantized Inference Output

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and Mean operations.
All layers match except Conv2D, where I consistently see a mismatch of -1 for about ~1% of values compared to TFLite’s output in many layers (not all).

The layer outputs are obtained from:

  • interpreter.invoke()

  • interpreter._get_ops_details()

  • interpreter.get_tensor(interpreter._get_tensor_details(_input, 0)["index"])

Example of mismatch

Input SRM Output DRM Output TF-Lite Output
5907 12 13 13
-19651 -31 -31 -30
-12264 -46 -46 -46

Quantization parameters

Iz (input zero-point) = -127
Oz (output zero-point) = -3
Is (input scale) = 0.15857075154781342
Os (output scale) = 0.5111709833145142
Fs (filter scale) = 0.010584672912955284
bS (bias scale) = 0.0016784195322543383
M (real multiplier) = 0.00328347971662879
M0 (quantized multiplier) = 1805112064
shift = -8
bias = -885

Quantized multiplication implementation

I implemented TFLite’s core math primitives, including:

  • SaturatingRoundingDoublingHighMul

  • RoundingDivideByPOT

  • Single-rounding (SRM) and double-rounding (DRM) variants

(Full code included below.)

Conv2D implementation

Conv2D uses the standard convolution formula:

acc = sum((x - Iz) * (w - Fz)) + bias
acc = MultiplyByQuantizedMultiplier(acc, M[channel], shift[channel])
acc = acc + Oz
acc = clamp(acc, act_min, act_max)

(Full code included below.)

Problem

DepthwiseConv2D, which uses the same quantized-multiplier code, matches exactly with TF-Lite. But Conv2D does not, and I see a ~1% discrepancy in many values.

Why does Conv2D mismatch TF-Lite outputs while DepthwiseConv2D matches perfectly?

There is a difference in the exact TFLite implementation, but I haven’t found it yet. We need help figuring it out.

Codes

QuantizationFunction is implemented to emulate multiplication by quantized multiplier. The code is:

class QuantizationUtils:
    INT32_MIN = -2**31
    INT32_MAX =  2**31 - 1
    bit = 31


    @staticmethod
    def quantize_multiplier_smaller_than_one(real_multiplier):
        """
        Given a real multiplier in (0, 1) or > 1,
        compute the quantized int32 multiplier and shift
        that approximates real_multiplier ≈ multiplier * 2^shift.
        """


        if real_multiplier == 0:
            return 0, 0


        shift = 0
        significand = real_multiplier


        # Normalize to [0.5, 1)
        while significand < 0.5:
            significand *= 2.0
            shift -= 1
        while significand >= 1.0:
            significand /= 2.0
            shift += 1


        # Convert to fixed-point 32-bit representation (Q31 format)
        q = int(round(significand * (1 << QuantizationUtils.bit)))


        if q == (1 << QuantizationUtils.bit):
            q //= 2
            shift += 1


        return q, shift


    @staticmethod
    def saturating_rounding_doubling_high_mul(a: int, b: int) -> int:
        """Python equivalent of TFLite's SaturatingRoundingDoublingHighMul."""
        # Special overflow case: INT32_MIN * INT32_MIN → INT32_MAX
        if a == QuantizationUtils.INT32_MIN and b == QuantizationUtils.INT32_MIN:
            return QuantizationUtils.INT32_MAX


        # Perform 64-bit multiplication
        a_64 = int(a)
        b_64 = int(b)
        ab_64 = a_64 * b_64


        # Rounding offset (nudge) according to sign
        nudge = (1 << 30) if ab_64 >= 0 else (1 - (1 << 30))


        # Multiply by 2 and extract high 32 bits (simulate doubling_high_mul)
        result = (ab_64 + nudge) >> 31


        # Saturate to int32 range
        if result > QuantizationUtils.INT32_MAX:
            result = QuantizationUtils.INT32_MAX
        elif result < QuantizationUtils.INT32_MIN:
            result = QuantizationUtils.INT32_MIN


        return int(result)


    @staticmethod
    def rounding_divide_by_pot(x: int, exponent: int) -> int:
        """Equivalent to TFLite's RoundingDivideByPOT (round-to-nearest, ties away from zero)."""
        assert 0 <= exponent <= 31


        mask = (1 << exponent) - 1
        remainder = x & mask
        threshold = (mask >> 1)


        if x < 0:
            threshold += 1


        # Round to nearest integer, ties away from zero
        result = (x >> exponent) + (1 if (remainder > threshold) else 0)


        return result


    @staticmethod
    def multiply_by_quantized_multiplier_DRM(x: int, quantized_multiplier: int, shift: int) -> int:
        """
        Simulates the core quantized multiply in TFLite:
        result = x * quantized_multiplier * 2^shift, with rounding and saturation.
        """
        # shift < 0 means right shift (division by power of two)
        # shift > 0 means left shift (multiplication by power of two)
        if shift < 0:
            return QuantizationUtils.rounding_divide_by_pot(
                QuantizationUtils.saturating_rounding_doubling_high_mul(x, quantized_multiplier), -shift
            )
        else:
            return QuantizationUtils.saturating_rounding_doubling_high_mul(x, quantized_multiplier) * (1 << shift)
       
   
    @staticmethod
    def multiply_by_quantized_multiplier_SRM(x: int, quantized_multiplier: int, shift: int) -> int:
        """
        Simulates the core quantized multiply in TFLite:
        result = x * quantized_multiplier * 2^shift, with rounding and saturation.
        """
        # shift < 0 means right shift (division by power of two)
        # shift > 0 means left shift (multiplication by power of two)
        # Perform 64-bit multiplication
        a_64 = int(x)
        b_64 = int(quantized_multiplier)
        ab_64 = a_64 * b_64


        if shift < 0:
            return QuantizationUtils.rounding_divide_by_pot(
                ab_64, 31-shift
            )
        else:
            return QuantizationUtils.rounding_divide_by_pot(ab_64, 31+shift)


    @staticmethod
    def mul_by_quantized_multiplier_smaller_than_one_exp(x, M, rshift, impl="single"):
        """
        Convenience wrapper; set impl="single" to do single rounding,
        or impl="double" for double rounding.
        """
        if impl == "single":
            return QuantizationUtils.multiply_by_quantized_multiplier_SRM(x, M, rshift)
        elif impl == "double":
            return QuantizationUtils.multiply_by_quantized_multiplier_DRM(x, M, rshift)
        else:
            raise ValueError("impl must be 'single' or 'double'")

The code for convolution (Conv2D) is:

def call(self, input):
    """
    Performs the forward pass of the Conv2D layer


    acc = sum((x - Iz) * (w - Fz)) + bias
    acc = MultiplyByQuantizedMultiplier(acc, M[channel], shift[channel])
    acc = acc + Oz
    acc = clamp(acc, act_min, act_max)
    """
    if self.verbose:
        print("Running Conv2D Layer (TFLite compatible)\n--------------------")


    # === 1. Pad and prepare input ===
    input = self.input_padding(input)
    _, self.H_in_pad, self.W_in_pad, _ = input.shape
    self.input_shape_pad = (self.H_in_pad, self.W_in_pad)


    # Output size calculation
    self.output_shape = tuple(
        ((x - y) // z + 1)
        for x, y, z in zip(self.input_shape_pad, self.filter_shape, self.strides)
    )
    self.H_out, self.W_out = self.output_shape


    # === 2. Initialize output ===
    final_output = np.zeros((self.batch, self.H_out, self.W_out, self.C_out_filter), dtype=np.int32)


    # === 3. Perform convolution per output channel ===
    for channel in range(self.C_out_filter):
        # Extract per-channel filter and bias
        filter_c = self.filter[:, :, :, channel]
        bias_c = int(self.bias[channel])


        # Per-channel multiplier & shift
        M_c = self.M[channel]
        shift_c = self.shift[channel]


        for b in range(self.batch):
            for i in range(self.H_out):
                for j in range(self.W_out):
                    # Region slice
                    h_start = i * self.H_stride
                    w_start = j * self.W_stride
                    h_end = h_start + self.H_filter
                    w_end = w_start + self.W_filter


                    region = input[b, h_start:h_end, w_start:w_end, :]


                    # === 4. True TFLite accumulation ===
                    # Subtract input and weight zero-points before multiply
                    acc = np.sum((region - self.Iz) * (filter_c)) + bias_c


                    # === 5. Apply quantized scaling ===
                    acc = QuantizationUtils.mul_by_quantized_multiplier_smaller_than_one_exp(
                        acc, M_c, shift_c, impl="single"
                    )


                    # === 6. Add output zero-point and clamp ===
                    acc += self.Oz


                    # ReLU6 or ReLU (clamp to activation range)
                    if self.relu_flag:
                        acc = np.maximum(acc, self.Oz)
                    acc = np.clip(acc, self.act_min, self.act_max)


                    final_output[b, i, j, channel] = acc


    return final_output.astype(np.int8)
1 Like

Hi @Jom_Kuriakose, Welcome to the Google AI Discussion Forum. I truly appreciate your work, you are a pro. You already done the SaturatingRoundingDoublingHighMul. I checked all values and think that the issue of value mismatch is directing us towards difference between the Conv2D SRM and Depthwise DRM.

Depthwise DRM is following simple SIMD loops which strictly implemented Double Rounding - rounding once during the doubling-high-mul and againg after the bit-shift.
Now, this Conv2D is lowered via gemmlowp to the high optimised gemm path, and for saving the cycles and improving the precisions, the kernels using single rounding technique, only 64-bits product is maintained and that is rounded once only at last end of the shift.

By $dollar example you will easily see the difference - double rounding rounds $12.49 to round to $12.5 in step one, then again to $13 in step two, single rounding will yield $12 only coz the $12.49 is less than mid point thus rounded back to $12.

So suggest you to use the snippet given below which involves small fix for - rounding_divide_by_pot.


@staticmethod
def multiply_by_quantized_multiplier_SRM_CONV(x: int, M0: int, shift: int) -> int:
    # shift is usually negative for a smaller-than-one multipliers (e.g., -8)
    # TFLite Conv2D SRM will merges the 31-bit doubling and the right-shift
    total_right_shift = 31 - shift 
    
    # 1. Firstly full 64-bit multiplication
    full_product = int(x) * int(M0)
    
    # 2. Then adding a single "nudge" for rounding at the new total bit-depth
    nudge = 1 << (total_right_shift - 1)
    
    # 3. Last, one single shift and return
    return (full_product + nudge) >> total_right_shift

This will help you fix that 1% differences. Further, I will suggest you to use Netron, probably you already using it, that will help for inspecting that Conv2D layer.

I hope this snippet + info will help you get move to further steps. Thanks.