Entire computer crashes when running tensorflow code

xxteknolust · November 6, 2024, 5:16am

Hello,

I have below tensorflow code I’ve stripped down from another tool that basically crashes my entire computer when it runs (gpu shuts off, can’t ssh, other hardware cuts off). Specifically the part where model(…) is called. My img dataset has 3 images in it of medium quality, but even if I do 1, same thing happens.

I’m using the SmilingWolf/wd-v1-4-convnext-tagger-v2 · Hugging Face model I manually git cloned from hugging face (but I’ve tried with the python hugging face hub module and same issue)

I’m running ubuntu 24.04, 3090 rtx, plenty of ram, plenty of cpu.
Kernel 6.8.0-48-generic
I literally see nothing in the python console when it crashes, nothing in syslog, kernel.log, dmesg, or any of the gpu logs.
I enabled kdump and not even getting any crash dumps, I did test I could generate a crash dump so I’m sure kdump is working
I am on 565.57.01 nvidia version and cuda version 12.7
I did install the cudnn stuff too
I disabled the gpu via CUDA_VISIBLE_DEVICES and the crash still happens

Any ideas how I can troubleshoot this further?

from tensorflow.keras.models import load_model
import numpy as np
from PIL import Image
import os

#os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

model = load_model("./wd-v1-4-convnextv2-tagger-v2")

image_dir = './train/img/'

path_imgs = []
for filename in os.listdir(image_dir):
    if filename.endswith('.jpg') or filename.endswith('.png'):
        path = os.path.join(image_dir, filename)
        img = Image.open(path).convert('RGB')
        img = img.resize((448, 448))
        img_array = np.array(img).astype(np.float32) / 255.0
        mean = np.array([0.485, 0.456, 0.406]).reshape(1, 1, 3)
        std = np.array([0.229, 0.224, 0.225]).reshape(1, 1, 3)
        img_array = (img_array - mean) / std

        path_imgs.append((path, img_array))

imgs = np.array([im for _, im in path_imgs])
probs = model(imgs, training=False)

print(probs)

python requirements file I’m using

--extra-index-url https://download.pytorch.org/whl/cu118
torch==2.1.2+cu118 
torchvision==0.16.2+cu118 
xformers==0.0.23.post1+cu118
bitsandbytes==0.43.0
onnxruntime-gpu==1.17.1
accelerate==0.25.0
transformers==4.36.2
diffusers[torch]==0.25.0
ftfy==6.1.1
opencv-python==4.8.1.78
einops==0.7.0
pytorch-lightning==1.9.0
bitsandbytes==0.43.0
prodigyopt==1.0
lion-pytorch==0.0.6
safetensors==0.4.2
altair==4.2.2
easygui==0.98.3
toml==0.10.2
voluptuous==0.13.1
huggingface-hub==0.20.1
imagesize==1.4.1
tensorflow==2.10.1
rich==13.7.0

$ nvidia-smi
Wed Nov  6 00:12:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0A:00.0  On |                  N/A |
|  0%   58C    P8             36W /  350W |    1254MiB /  24576MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Kiran_Sai_Ramineni · November 7, 2024, 7:16am

Hi @xxteknolust, In the python requirements file I can see that you are using the Tensorflow 2.10.1 and you have installed CUDA 12.7, but as per test build configuration 2.10 supports CUDA 11.2. I have tried to execute the above given code in colab with keras3 I was not able to load the model using this code line and am facing an error related to the saved model format.

model = load_model("./wd-v1-4-convnextv2-tagger-v2")

so i tried to import the model from huggingface_hub and was able to load and call the model on one test image. Please refer to this gist for working code example. Could you please try to use the working code present in colab in your machine with CUDA and Tensorflow installed as per the test build configuration and let us know if are still facing the crash. Thank You.

xxteknolust · November 16, 2024, 2:44pm

so I ended up figuring this out

I noticed below warnings, so I figured out the exact versions of cudnn and cuda that were needed, and once I had those install, my PC stopped crashing

W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libcudart.so.11.0’; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Although it would be nice if tensorflow didn’t crash my pc if im missing the right combination of supported libraries

Topic		Replies	Views
Conv1d causing system to crash TensorFlow models , datasets , tfdata , gpu	2	504	November 7, 2023
Kernel Crashed VS code General Discussion gpu	6	1865	November 25, 2022
Tensorflow crashing before running the first epoch General Discussion help_request	5	798	September 18, 2023
I have problem with train models in tensorflow General Discussion models , tensorflow	8	590	December 6, 2023
Cannot run on Nvidia GPU General Discussion nlp , keras , gpu , windows , help_request	9	6167	February 2, 2022

Entire computer crashes when running tensorflow code

Related topics