Hello,
I have below tensorflow code I’ve stripped down from another tool that basically crashes my entire computer when it runs (gpu shuts off, can’t ssh, other hardware cuts off). Specifically the part where model(…) is called. My img dataset has 3 images in it of medium quality, but even if I do 1, same thing happens.
I’m using the SmilingWolf/wd-v1-4-convnext-tagger-v2 · Hugging Face model I manually git cloned from hugging face (but I’ve tried with the python hugging face hub module and same issue)
- I’m running ubuntu 24.04, 3090 rtx, plenty of ram, plenty of cpu.
- Kernel 6.8.0-48-generic
- I literally see nothing in the python console when it crashes, nothing in syslog, kernel.log, dmesg, or any of the gpu logs.
- I enabled kdump and not even getting any crash dumps, I did test I could generate a crash dump so I’m sure kdump is working
- I am on 565.57.01 nvidia version and cuda version 12.7
- I did install the cudnn stuff too
- I disabled the gpu via CUDA_VISIBLE_DEVICES and the crash still happens
Any ideas how I can troubleshoot this further?
from tensorflow.keras.models import load_model
import numpy as np
from PIL import Image
import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
model = load_model("./wd-v1-4-convnextv2-tagger-v2")
image_dir = './train/img/'
path_imgs = []
for filename in os.listdir(image_dir):
if filename.endswith('.jpg') or filename.endswith('.png'):
path = os.path.join(image_dir, filename)
img = Image.open(path).convert('RGB')
img = img.resize((448, 448))
img_array = np.array(img).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]).reshape(1, 1, 3)
std = np.array([0.229, 0.224, 0.225]).reshape(1, 1, 3)
img_array = (img_array - mean) / std
path_imgs.append((path, img_array))
imgs = np.array([im for _, im in path_imgs])
probs = model(imgs, training=False)
print(probs)
python requirements file I’m using
--extra-index-url https://download.pytorch.org/whl/cu118
torch==2.1.2+cu118
torchvision==0.16.2+cu118
xformers==0.0.23.post1+cu118
bitsandbytes==0.43.0
onnxruntime-gpu==1.17.1
accelerate==0.25.0
transformers==4.36.2
diffusers[torch]==0.25.0
ftfy==6.1.1
opencv-python==4.8.1.78
einops==0.7.0
pytorch-lightning==1.9.0
bitsandbytes==0.43.0
prodigyopt==1.0
lion-pytorch==0.0.6
safetensors==0.4.2
altair==4.2.2
easygui==0.98.3
toml==0.10.2
voluptuous==0.13.1
huggingface-hub==0.20.1
imagesize==1.4.1
tensorflow==2.10.1
rich==13.7.0
$ nvidia-smi
Wed Nov 6 00:12:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:0A:00.0 On | N/A |
| 0% 58C P8 36W / 350W | 1254MiB / 24576MiB | 10% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0