Hi- I’m trying to run the tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan example that came with my install on Rocky Linux and using Python 3.9.16. A tensorflow Hello World works fine. I get these errors with the DCGAN example:
[cht@node001 dcgan]$ python dcgan.py --epochs 5
/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.1.0)/charset_normalizer (2.0.10) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
2023-03-24 19:56:07.994504: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
I0324 19:56:08.062144 23456247932736 dataset_builder.py:400] Generating dataset mnist (/home/cht/tensorflow_datasets/mnist/3.0.1)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/cht/tensorflow_datasets/mnist/3.0.1...
2023-03-24 19:56:08.236676: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
Dl Completed...: 0 url [00:00, ? url/s] I0324 19:56:08.302374 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.7ffaeb3008d44174b0e8dd5996132142...
Dl Completed...: 0%| I0324 19:56:08.306265 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-labels-idx1-ubyte4Mqf5UL1fRrpd5pIeeAh8c8ZzsY2gbIPBuKwiyfSD_I.gz.tmp.f0ea75ec438e4c3bb1454b1eea56d872...
Dl Completed...: 0%| I0324 19:56:08.309630 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-images-idx3-ubyteJAsxAi0QnOBEygBw_XW2X7zp-LBZAIqqYSHN8ru4ZO4.gz.tmp.39472c44a90b4f05bf9f8db9d53c140a...
Dl Completed...: 0%| I0324 19:56:08.312614 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-labels-idx1-ubytedcDWkl3FO9T-WMEH1f1Xt51eIRmePRIMAk6X147Qw8w.gz.tmp.8c24939cb1994f91b7eb0199ac9cead8...
Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.01 file/s]
Dl Size...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 5.02 MiB/s]
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.01 url/s]
Generating splits...: 0%| 2023-03-24 19:56:11.052947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1631 MB memory: -> device: 0, name: Quadro P600, pci bus id: 0000:07:00.0, compute capability: 6.1
I0324 19:56:27.018659 23456247932736 tfrecords_writer.py:327] Done writing mnist-train.tfrecord. Number of examples: 60000 (shards: [60000])
Generating splits...: 50%|█████████████████████████████████████████████████████████████████████ | 1/2 [00:16<00:16, 16.73s/ splitsI0324 19:56:29.740445 23456247932736 tfrecords_writer.py:327] Done writing mnist-test.tfrecord. Number of examples: 10000 (shards: [10000])
Dataset mnist downloaded and prepared to /home/cht/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
I0324 19:56:29.743749 23456247932736 logging_logger.py:35] Constructing tf.data.Dataset mnist for split train, from /home/cht/tensorflow_datasets/mnist/3.0.1
Training ...
2023-03-24 19:56:31.056080: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
2023-03-24 19:56:31.547750: E tensorflow/stream_executor/dnn.cc:764] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(4706): 'cudnnBatchNormalizationForwardTrainingEx( cudnn.handle(), mode, bn_ops, &one, &zero, x_descriptor.handle(), x.opaque(), x_descriptor.handle(), side_input.opaque(), x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(), scale.opaque(), offset.opaque(), exponential_average_factor, batch_mean_opaque, batch_var_opaque, epsilon, saved_mean->opaque(), saved_inv_var->opaque(), activation_desc.handle(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2023-03-24 19:56:31.551308: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.551325: W tensorflow/core/kernels/gpu_utils.cc:69] Failed to check cudnn convolutions for out-of-bounds reads and writes with an error message: 'stream did not block host until done; was already in an error state'; skipping this check. This only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2023-03-24 19:56:31.551345: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.556861: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.556886: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:32.320397: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Traceback (most recent call last):
File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 225, in <module>
app.run(run_main)
File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 213, in run_main
main(**kwargs)
File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 222, in main
return dcgan_obj.train(train_dataset, checkpoint_pr)
File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 194, in train
gen_loss, disc_loss = self.train_step(image)
File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([64,128,7,7])
[[node sequential/batch_normalization_1/FusedBatchNormV3
(defined at /cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/layers/normalization/batch_normalization.py:589)
]] [Op:__inference_train_step_142280]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/batch_normalization_1/FusedBatchNormV3:
In[0] sequential/conv2d_transpose/conv2d_transpose (defined at /cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/backend.py:5530)
In[1] sequential/batch_normalization_1/ReadVariableOp:
In[2] sequential/batch_normalization_1/ReadVariableOp_1:
In[3] sequential/batch_normalization_1/FusedBatchNormV3/ReadVariableOp:
In[4] sequential/batch_normalization_1/FusedBatchNormV3/ReadVariableOp_1:
Operation defined at: (most recent call last)
>>> File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 225, in <module>
>>> app.run(run_main)
>>>
>>> File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 312, in run
>>> _run_main(main, args)
>>>
>>> File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
>>> sys.exit(main(argv))
>>>
>>> File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 213, in run_main
>>> main(**kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 222, in main
>>> return dcgan_obj.train(train_dataset, checkpoint_pr)
>>>
>>> File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 194, in train
>>> gen_loss, disc_loss = self.train_step(image)
>>>
>>> File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 157, in train_step
>>> generated_images = self.generator(noise, training=True)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1083, in __call__
>>> outputs = call_fn(inputs, *args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/engine/sequential.py", line 373, in call
>>> return super(Sequential, self).call(inputs, training=training, mask=mask)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/engine/functional.py", line 451, in call
>>> return self._run_internal_graph(
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/engine/functional.py", line 589, in _run_internal_graph
>>> outputs = node.layer(*args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1083, in __call__
>>> outputs = call_fn(inputs, *args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/layers/normalization/batch_normalization.py", line 767, in call
>>> outputs = self._fused_batch_norm(inputs, training=training)
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/layers/normalization/batch_normalization.py", line 623, in _fused_batch_norm
>>> output, mean, variance = control_flow_util.smart_cond(
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/utils/control_flow_util.py", line 105, in smart_cond
>>> return tf.__internal__.smart_cond.smart_cond(
>>>
>>> File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/layers/normalization/batch_normalization.py", line 589, in _fused_batch_norm_training
>>> return tf.compat.v1.nn.fused_batch_norm(