Understand tflite model running on android GPU

I have a CNN model that I have converted to tflite and want to run on an android phone GPU. I run it like this:

TfLiteGpu.isGpuDelegateAvailable(context).addOnSuccessListener { isGPUAvailable ->
    try {
        val options = InterpreterApi.Options()
            .setRuntime(InterpreterApi.Options.TfLiteRuntime.FROM_SYSTEM_ONLY)
            .addDelegateFactory(GpuDelegateFactory())
        val interpreterAPI = InterpreterApi.create(modelFile, options)
        interpreterAPI.run(input, output)
        resultGPUText = output[0].joinToString(separator = "\n")

    } catch (e: Exception) {
        errorText = e.message ?: ""
        Log.e("tflite", e.message, e)
    }
}

It runs about 75% faster on GPU than on CPU so that is good.
However, this is the log I get:

Initialized TensorFlow Lite runtime.
Replacing 52 out of 65 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 6 partitions for the whole graph.
Created interpreter.
Created TensorFlow Lite delegate for GPU.
Replacing 29 out of 65 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 3 partitions for the whole graph.
Initialized OpenCL-based API.
Created 1 GPU delegate kernels.
Replacing 25 out of 37 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 2 partitions for the whole graph.
Created interpreter.

I don’t fully understand what it means but I’m interpreting it as that after some optimization the model is split in 2 and 25 out of 37 operations are still run on CPU not GPU.

I did use this function when converting the model to tflite:
tf.lite.experimental.Analyzer.analyze(model_content=tflite_model, gpu_compatibility=True) which said that everything except the first operation which is a cast from uin8 to float32 was GPU compatible.

Why is so much run on CPU? How can I see what operations are run where?
Thanks!

HI @Johan_Ek ,

Thank you for providing such a detailed explanation of your situation with the TensorFlow Lite model on Android. The log shows three main stages of optimization:
a) XNNPack Delegate (CPU): 52 out of 65 operations are optimized using XNNPack, a CPU optimization. b) GPU Delegate: 29 out of 65 operations are moved to the GPU. c) Final XNNPack Delegate (CPU): A final optimization step for the remaining CPU operations.
Several factors contribute to this:
a) Compatibility: The first operation (cast from uint8 to float32) isn’t GPU compatible, potentially creating a bottleneck. b) Partitioning: TFLite minimizes data transfer between CPU and GPU for efficiency. c) Operation support: Some operations might lack efficient GPU implementations in TFLite.