Code runs very slow on Google Cloud Platform, PyCapsule.TFE_Py_Execute very slow?

My code runs fine on my machine, doing signal filtering and inference in about 2 minutes. The same code takes about 8 minutes on GCP. Everything is slower, including e.g. calls to scipy.signal functions. The delay seems to be in PyCapsule.TFE_Py_Execute. Tensorflow 2.15.1 on both machines, numpy, scipy, scikit-learn, nvidia* are the same versions. The only difference I see that might be relevant is the version of python on GCP is from conda-forge.

Any insights greatly appreciated!

My machine (i9-13900k, RTX A4500):

└─ 82.053 RawClassifier.classify  ../../src/module/classifier.py:209
   ├─ 71.303 Model.predictions  ../../src/module/model.py:135
   │  ├─ 43.145 Model.process  ../../src/module/model.py:78
   │  │  ├─ 24.823 load_model  keras/src/saving/saving_api.py:176
   │  │  │     [5 frames hidden]  keras
   │  │  └─ 17.803 error_handler  keras/src/utils/traceback_utils.py:59
   │  │        [22 frames hidden]  keras, tensorflow, <built-in>
   │  ├─ 15.379 Model.process  ../../src/module/model.py:78
   │  │  ├─ 6.440 load_model  keras/src/saving/saving_api.py:176
   │  │  │     [5 frames hidden]  keras
   │  │  └─ 8.411 error_handler  keras/src/utils/traceback_utils.py:59
   │  │        [12 frames hidden]  keras, tensorflow, <built-in>
   │  └─ 12.772 Model.process  ../../src/module/model.py:78
   │     ├─ 6.632 load_model  keras/src/saving/saving_api.py:176
   │     │     [6 frames hidden]  keras
   │     └─ 5.580 error_handler  keras/src/utils/traceback_utils.py:59

Compared to GCP (8 vCPU, T4):

└─ 262.203 RawClassifier.classify  ../../module/classifier.py:212
   ├─ 226.644 Model.predictions  ../../module/model.py:129
   │  ├─ 150.693 Model.process  ../../module/model.py:72
   │  │  ├─ 25.310 load_model  keras/src/saving/saving_api.py:176
   │  │  │     [6 frames hidden]  keras
   │  │  └─ 123.869 error_handler  keras/src/utils/traceback_utils.py:59
   │  │        [22 frames hidden]  keras, tensorflow, <built-in>
   │  ├─ 42.631 Model.process  ../../module/model.py:72
   │  │  ├─ 6.830 load_model  keras/src/saving/saving_api.py:176
   │  │  │     [2 frames hidden]  keras
   │  │  └─ 34.270 error_handler  keras/src/utils/traceback_utils.py:59
   │  │        [16 frames hidden]  keras, tensorflow, <built-in>
   │  └─ 33.308 Model.process  ../../module/model.py:72
   │     ├─ 7.387 load_model  keras/src/saving/saving_api.py:176
   │     │     [2 frames hidden]  keras
   │     └─ 24.427 error_handler  keras/src/utils/traceback_utils.py:59

And more detail on the GCP run. Note the next to the last line that calls PyCapsule.TFE_Py_Execute:

├─ 262.203 RawClassifier.classify  ../../module/classifier.py:212
│  ├─ 226.644 Model.predictions  ../../module/model.py:129
│  │  ├─ 226.633 Model.process  ../../module/model.py:72
│  │  │  ├─ 182.566 error_handler  keras/src/utils/traceback_utils.py:59
│  │  │  │  ├─ 182.372 Functional.predict  keras/src/engine/training.py:2451
│  │  │  │  │  ├─ 170.326 error_handler  tensorflow/python/util/traceback_utils.py:138
│  │  │  │  │  │  └─ 170.326 Function.__call__  tensorflow/python/eager/polymorphic_function/polymorphic_function.py:803
│  │  │  │  │  │     └─ 170.326 Function._call  tensorflow/python/eager/polymorphic_function/polymorphic_function.py:850
│  │  │  │  │  │        ├─ 141.490 call_function  tensorflow/python/eager/polymorphic_function/tracing_compilation.py:125
│  │  │  │  │  │        │  ├─ 137.241 ConcreteFunction._call_flat  tensorflow/python/eager/polymorphic_function/concrete_function.py:1209
│  │  │  │  │  │        │  │  ├─ 137.240 AtomicFunction.flat_call  tensorflow/python/eager/polymorphic_function/atomic_function.py:215
│  │  │  │  │  │        │  │  │  ├─ 137.239 AtomicFunction.__call__  tensorflow/python/eager/polymorphic_function/atomic_function.py:220
│  │  │  │  │  │        │  │  │  │  ├─ 137.233 Context.call_function  tensorflow/python/eager/context.py:1469
│  │  │  │  │  │        │  │  │  │  │  ├─ 137.230 quick_execute  tensorflow/python/eager/execute.py:28
│  │  │  │  │  │        │  │  │  │  │  │  ├─ 137.190 PyCapsule.TFE_Py_Execute  <built-in>
│  │  │  │  │  │        │  │  │  │  │  │  └─ 0.040 <listcomp>  tensorflow/python/eager/execute.py:54

Hi @John, This might be due to the floating-point operations performed per second by rtx A4500 is greater than T4 GPU. Due this the execution time is faster in you local machine having RTX A4500 compared to the GCP having T4 GPU.

Thank You.