SLURM errors: failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error; GPU:0 unknown device

We have a SLURM batch file that fails with TF2 and Keras, and also fails when called directly on a node that has a GPU. Here is the Python script contents:

from datetime import date
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
from keras.optimizers import adam
from keras.layers import Dropout
from tensorflow.keras.callbacks import Callback, EarlyStopping
from sklearn.preprocessing import StandardScaler
from datetime import datetime, timedelta
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
from keras.models import load_model
from keras.callbacks import EarlyStopping, ModelCheckpoint
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import logging
delay = 252
window = 60
factor = 15
K = 8.4
sbo = 1.25
sso = 1.25
sbc = 0.75
ssc = 0.5
r = 0.02
tran_cost = 0.0002
leverage = 1.0
start_val = 100
bo = 1
so = -1
X = pd.DataFrame(columns=range(0, window))
Y = []
for tag in X_pd.columns[:1]:
    # i=0 ....len(X_pd.index)-window
    for i in range(0, len(X_pd.index) - window):
        X_example = X_pd.loc[i:i + window - 1][tag].values

        X= X.append(pd.Series(X_example), ignore_index=True)
        Y.append(X_pd.loc[i + window][tag])
    print('done %s stocks' % (tag))
SS = StandardScaler()
features = SS.fit_transform(X.values)
#LSTM model
def trainLSTMModel(layers, neurons, d):
    model = Sequential()

    model.add(LSTM(neurons[0], input_shape=(layers[1], layers[2]), return_sequences=False,activation='relu'))

    #model.add(LSTM(neurons[1], input_shape=(layers[1], layers[2]), return_sequences=False))

    #model.add(Dense(neurons[2], kernel_initializer="uniform", activation='relu'))
    model.add(Dense(neurons[3], kernel_initializer="uniform", activation='relu'))
    #adam = Adam(decay=0.2)
    # predict up and down
    # model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy'])
    model.compile(loss='mse', optimizer=optimizer)
    return model
time_step = 60
d = 0.3
shape = [length,time_step, output] # feature, window, output
neurons = [64, 64, 32, 1]
epochs = 100
model = trainLSTMModel(shape, neurons, d)
#shape from [samples, timesteps] into [samples, timesteps, features]
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
gpu_no = 0
with tf.device('/gpu:' + str(gpu_no)):
#    sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
#    keras.backend.set_session(sess)

    print('model_manager: running tensorflow version: ' + tf.__version__)
    print('model_manager: will attempt to run on ' + '/gpu:' + str(gpu_no)), Y, epochs=epochs, verbose=2,batch_size=batch_size)

The log shows this:

Loading requirement: cuda10.1/toolkit/10.1.243
Loading cm-ml-python3deps/3.3.0
  Loading requirement: gcc5/5.5.0 python36
Loading tensorflow2-py36-cuda10.1-gcc/2.0.0
  Loading requirement: ml-pythondeps-py36-cuda10.1-gcc/3.3.0
    openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py36-cuda10.1-gcc/2.3.1
    protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.7.8
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0
2021-08-18 11:11:43.064175: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library

2021-08-18 11:18:08.026219: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-08-18 11:18:08.031771: E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-08-18 11:18:08.031811: I tensorflow/stream_executor/cuda/] retrieving CUDA diagnostic information for host: node001
2021-08-18 11:18:08.031819: I tensorflow/stream_executor/cuda/] hostname: node001
2021-08-18 11:18:08.031921: I tensorflow/stream_executor/cuda/] libcuda reported version is: 460.73.1
2021-08-18 11:18:08.031958: I tensorflow/stream_executor/cuda/] kernel reported version is: 460.73.1
2021-08-18 11:18:08.031966: I tensorflow/stream_executor/cuda/] kernel version seems to match DSO: 460.73.1
2021-08-18 11:18:08.032266: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
Using TensorFlow backend.
done A stocks
Model: "sequential_1"
Layer (type)                 Output Shape              Param #
lstm_1 (LSTM)                (None, 64)                16896
dense_1 (Dense)              (None, 1)                 65
Total params: 16,961
Trainable params: 16,961
Non-trainable params: 0
model_manager: running tensorflow version: 2.0.0
model_manager: will attempt to run on /gpu:0
Traceback (most recent call last):
  File "", line 99, in <module>, Y, epochs=epochs, verbose=2,batch_size=batch_size)
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/engine/", line 1213, in fit
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/engine/", line 316, in _make_train_function
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/legacy/", line 91, in wrapper
    return func(*args, **kwargs)
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/backend/", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/", line 519, in get_updates
    for (i, p) in enumerate(params)]
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/", line 519, in <listcomp>
    for (i, p) in enumerate(params)]
  File "/cm/shared/apps/keras-py36-cuda10.1-gcc/2.3.1/lib/python3.6/site-packages/keras/backend/", line 963, in zeros
    v = tf.zeros(shape=shape, dtype=dtype, name=name)
  File "/cm/shared/apps/tensorflow2-py36-cuda10.1-gcc/2.0.0/lib/python3.6/site-packages/tensorflow_core/python/ops/", line 2349, in zeros
    output = _constant_if_small(zero, shape, dtype, name)
  File "/cm/shared/apps/tensorflow2-py36-cuda10.1-gcc/2.0.0/lib/python3.6/site-packages/tensorflow_core/python/ops/", line 2307, in _constant_if_small
    return constant(value, shape=shape, dtype=dtype, name=name)
  File "/cm/shared/apps/tensorflow2-py36-cuda10.1-gcc/2.0.0/lib/python3.6/site-packages/tensorflow_core/python/framework/", line 227, in constant
  File "/cm/shared/apps/tensorflow2-py36-cuda10.1-gcc/2.0.0/lib/python3.6/site-packages/tensorflow_core/python/framework/", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/cm/shared/apps/tensorflow2-py36-cuda10.1-gcc/2.0.0/lib/python3.6/site-packages/tensorflow_core/python/framework/", line 96, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

Why is the script not seeing the GPU?

Can you try to just list the visibile devices?

Part of the problem was the code requires TF > 2.0.

The only difference I see is that the user told me he got it to work by adjusting the comment tags as such:

#sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))

Now the GPU works.

I also changed:
from keras.optimizers import adam
from keras.optimizers import adam_v2

Before this the logfile blew up to 6 GB with entries like:

2021-08-19 05:08:41.796216: I tensorflow/core/framework/] No device-specific kernels found for NodeDef '{{node _SOURCE}}'Will fall back to a default kernel.
2021-08-19 05:08:41.796223: I tensorflow/core/framework/] No device-specific kernels found for NodeDef '{{node _SOURCE}}'Will fall back to a default kernel.
2021-08-19 05:08:41.796232: I tensorflow/core/framework/] Instantiating kernel for node: {{node _SINK}} = NoOp[]()
2021-08-19 05:08:41.796238: I tensorflow/core/framework/] No device-specific kernels found for NodeDef '{{node _SINK}}'Will fall back to a default kernel.
2021-08-19 05:08:41.796245: I tensorflow/core/framework/] No device-specific kernels found for NodeDef '{{node _SINK}}'Will fall back to a default kernel.
2021-08-19 05:08:41.796255: I tensorflow/core/framework/] Instantiating kernel for node: {{node training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/mod}} = FloorMod[T=DT_INT32, _class=["loc:@loss/dense_1_loss/mean_squared_error/Mean"]](training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/add, training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/Size)
2021-08-19 05:08:41.796283: I tensorflow/core/framework/] Instantiating kernel for node: {{node training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/add}} = AddV2[T=DT_INT32, _class=["loc:@loss/dense_1_loss/mean_squared_error/Mean"]](loss/dense_1_loss/mean_squared_error/Mean/reduction_indices, training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/Size)
2021-08-19 05:08:41.796303: I tensorflow/core/framework/] Instantiating kernel for node: {{node training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/Size}} = Const[_class=["loc:@loss/dense_1_loss/mean_squared_error/Mean"], dtype=DT_INT32, value=Tensor<type: int32 shape: [] values: 2>]()
2021-08-19 05:08:41.796319: I tensorflow/core/framework/] Instantiating kernel for node: {{node loss/dense_1_loss/mean_squared_error/Mean/reduction_indices}} = Const[dtype=DT_INT32, value=Tensor<type: int32 shape: [] values: -1>]()
2021-08-19 05:08:41.796335: I tensorflow/core/framework/] Instantiating kernel for node: {{node _send_training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/mod_0}} = _Send[T=DT_INT32, client_terminated=true, recv_device="/device:CPU:0", send_device="/device:CPU:0", send_device_incarnation=-6529568560417163830, tensor_name="training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/mod:0"](training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/mod)
2021-08-19 05:08:41.796357: I tensorflow/core/common_runtime/] Process node: 0 step -1 {{node _SOURCE}} = NoOp[]() device: /device:CPU:0
2021-08-19 05:08:41.796368: I tensorflow/core/common_runtime/] Process node: 4 step -1 {{node training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/Size}} = Const[_class=["loc:@loss/dense_1_loss/mean_squared_error/Mean"], dtype=DT_INT32, value=Tensor<type: int32 shape: [] values: 2>]() device: /device:CPU:0
2021-08-19 05:08:41.796378: I tensorflow/core/common_runtime/] Process node: 5 step -1 {{node loss/dense_1_loss/mean_squared_error/Mean/reduction_indices}} = Const[dtype=DT_INT32, value=Tensor<type: int32 shape: [] values: -1>]() device: /device:CPU:0
2021-08-19 05:08:41.796390: I tensorflow/core/common_runtime/] Process node: 3 step -1 {{node training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/add}} = AddV2[T=DT_INT32, _class=["loc:@loss/dense_1_loss/mean_squared_error/Mean"]](loss/dense_1_loss/mean_squared_error/Mean/reduction_indices, training/Adam/gradients/loss/dense_1_loss/mean_squared_error/Mean_grad/Size) device: /device:CPU:0

Anyways seems to be good now perhaps this will help someone down the line.

Well in Slurm this still fails

Loading cudnn7.6-cuda10.1/
  Loading requirement: cuda10.1/toolkit/10.1.243
Loading cm-ml-python3deps/3.3.0
  Loading requirement: gcc5/5.5.0 python36
Loading tensorflow2-py37-cuda10.1-gcc/2.2.0
  Loading requirement: python37 ml-pythondeps-py37-cuda10.1-gcc/4.1.2
    openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py37-cuda10.1-gcc/2.3.1
    protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.7.8
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0
2021-08-20 10:36:18.057370: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
Using TensorFlow backend.
Traceback (most recent call last):
  File "", line 9, in <module>
    from keras.models import Sequential
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/", line 3, in <module>
    from . import utils
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/utils/", line 6, in <module>
    from . import conv_utils
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/utils/", line 9, in <module>
    from .. import backend as K
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/backend/", line 1, in <module>
    from .load_backend import epsilon
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/backend/", line 90, in <module>
    from .tensorflow_backend import *
  File "/cm/shared/apps/keras-py37-cuda10.1-gcc/2.3.1/lib/python3.7/site-packages/keras/backend/", line 5, in <module>
    import tensorflow as tf
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/", line 41, in <module>
    from import module_util as _module_util
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/", line 64, in <module>
    from tensorflow.python.framework.framework_lib import *  # pylint: disable=redefined-builtin
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/framework/", line 24, in <module>
    from tensorflow.python.framework.device import DeviceSpec
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/framework/", line 24, in <module>
    from tensorflow.python.framework import device_spec
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/framework/", line 21, in <module>
    from tensorflow.python.util.tf_export import tf_export
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/util/", line 48, in <module>
    from tensorflow.python.util import tf_decorator
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/util/", line 64, in <module>
    from tensorflow.python.util import tf_stack
  File "/cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/util/", line 28, in <module>
    from tensorflow.python import _tf_stack
ImportError: /cm/shared/apps/tensorflow2-py37-cuda10.1-gcc/2.2.0/lib/python3.7/site-packages/tensorflow/python/ undefined symbol: PyThread_tss_set

Is this a known issue with TF 2.2.0?

Does it work with TF 2.6.0?

When I run this directly on a node which has Python 3.6 and TF 2.6 yes I get expected results: Is there a way to get the earlier TF/Keras to work with this?

done A stocks
Model: "sequential"
Layer (type)                 Output Shape              Param #
lstm (LSTM)                  (None, 64)                16896
dense (Dense)                (None, 1)                 65
Total params: 16,961
Trainable params: 16,961
Non-trainable params: 0
model_manager: running tensorflow version: 2.6.0
model_manager: will attempt to run on /gpu:0
Epoch 1/100
7/7 - 36s - loss: 38939.2383
Epoch 2/100
7/7 - 17s - loss: 38939.2383
Epoch 3/100

I don’t know but generally we have a support Policy for older versions, and so patch releases, only for security bugs.
So I suggest you to use an updated version of TF.

Even with 2.6 I see this error:

  Loading requirement: hpcx/2.4.0
2021-08-20 14:23:09.943253: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'li'; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/
2021-08-20 14:23:09.943288: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not hav
e a GPU set up on your machine.
2021-08-20 14:24:41.582692: E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_UNKNOWN: u
nknown error
2021-08-20 14:24:41.582920: I tensorflow/stream_executor/cuda/] retrieving CUDA diagnostic information
for host: node001
2021-08-20 14:24:41.582935: I tensorflow/stream_executor/cuda/] hostname: node001
2021-08-20 14:24:41.583068: I tensorflow/stream_executor/cuda/] libcuda reported version is: 460.73.1
2021-08-20 14:24:41.583108: I tensorflow/stream_executor/cuda/] kernel reported version is: 460.73.1
2021-08-20 14:24:41.583115: I tensorflow/stream_executor/cuda/] kernel version seems to match DSO: 460.
2021-08-20 14:24:41.583609: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneA
PI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-20 14:24:41.871823: I tensorflow/compiler/mlir/] None of the MLIR Optimization Pass
es are enabled (registered 2)
WARNING: Logging before flag parsing goes to stderr.
W0820 14:24:42.032056 46912496384256] AutoGraph could not transform <function Model.make_train_function.<loc
als>.train_function at 0x2aab736d7f28> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY
=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

It is a problem with your env setup as TF doesn’t find CUDA libraries in your system paths:

W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'li'; dlerror: cannot open shared object file: No such file or directory

Sorry I should’ve posted more of the logs. The CUDA diagnostic does appear to find CUDA. Just not the GPU.

2021-08-20 15:21:38.393015: E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_UNKNOWN: u
nknown error
2021-08-20 15:21:38.393070: I tensorflow/stream_executor/cuda/] retrieving CUDA diagnostic information
for host: node001
2021-08-20 15:21:38.393081: I tensorflow/stream_executor/cuda/] hostname: node001
2021-08-20 15:21:38.393208: I tensorflow/stream_executor/cuda/] libcuda reported version is: 460.73.1
2021-08-20 15:21:38.393248: I tensorflow/stream_executor/cuda/] kernel reported version is: 460.73.1
2021-08-20 15:21:38.393256: I tensorflow/stream_executor/cuda/] kernel version seems to match DSO: 460.
2021-08-20 15:29:06.834136: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneA
PI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-20 15:29:07.343075: I tensorflow/compiler/mlir/] None of the MLIR Optimization Pass
es are enabled (registered 2)
WARNING: Logging before flag parsing goes to stderr.
W0820 15:29:07.578475 46912496383040] AutoGraph could not transform <function Model.make_train_function.<loc
als>.train_function at 0x2aab74dc1840> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY
=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
done A stocks
Model: "sequential"
Layer (type)                 Output Shape              Param #
lstm (LSTM)                  (None, 64)                16896
dense (Dense)                (None, 1)                 65

Is that error Cause: 'arguments' object has no attribute 'posonlyargs' just a re herring?

I see that CUDA has failed to initialize. Your environment is not in good shape.

We had many CUDA setup issues in the repo like:

Well kind of. We use Bright Cluster with Slurm. So on our head node we use a “SBATCH” file (Slurm batch) that calls modules. TF 2.6 is not yet available in Bright’s packages. I used pip to install TF 2.6 on a node in Python 3. So now I exclude the call to the TF module in the SBATCH file and let Slurm auto-magically find that TF 2.6 I installed. It looks like we also needed CUDA 11 or greater. For now it’s running but without the GPU.

