on nvidia equipped machine i can run the training now with tf2.6 but nvidia-smi does not show anything. If the ML code is pushing kernel onto GPU, I can see it shows up along PID (below)
how do i verify that training is done on GPU or CPU?
As a comparison, I can run simple vector algebra kernel on GPU by explicitly pushing to GPU and I can see nvidia-smi rightly shows the name of the executable (a.out):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 40% 31C P2 52W / 215W | 298MiB / 7981MiB | 49% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 973 G /usr/lib/xorg/Xorg 29MiB |
| 0 N/A N/A 1207 G /usr/bin/gnome-shell 7MiB |
| 0 N/A N/A 3085 C ./a.out 257MiB |
+-----------------------------------------------------------------------------+
Ok, I see some problem there, seems lot of path problems, thanks for directing
>>> tf.config.get_visible_devices()
2021-10-23 08:33:36.075703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-23 08:33:36.076294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076383: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076443: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.110913: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111172: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111402: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111443: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
ok, now i isntalled 11.1 and 8.2 respectively and then this
591 sudo apt install cuda-11-2
594 apt install libcudnn8
>>> tf.config.get_visible_devices()
2021-10-23 09:06:29.654745: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2021-10-23 09:06:29.654790: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: nonroot-MS-7B22
2021-10-23 09:06:29.654800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: nonroot-MS-7B22
2021-10-23 09:06:29.654840: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 495.29.5
2021-10-23 09:06:29.654863: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.57.2
2021-10-23 09:06:29.654869: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.57.2 does not match DSO version 495.29.5 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
I reinstalled everything inckuding ubuntu otherwise older nvidia apps does not seem to be compeltely removed. It works now! thx
Β±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 970 G /usr/lib/xorg/Xorg 100MiB |
| 0 N/A N/A 1163 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 1430 G β¦setup/gnome-initial-setup 2MiB |
| 0 N/A N/A 4370 C python3 6941MiB |
1563/1563 [==============================] - 5s 3ms/step - loss: 0.7987 - accuracy: