Error occurred when finalizing GeneratorDataset iterator

human · June 17, 2021, 8:31am

Hi again

Thanks to all of your help, I can build Faster-RCNN model.
But it goes well except training step, and I hit the wall.
I debugged functions, so found a suspicious part, however I can’t catch what is root cause.

First, the version is:

tensorflow-gpu==2.5.0
CUDA==11.2.0
cuDNN==8.1.0.77

The whole tack trace is:

2021-06-17 16:46:58.163220: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.565562: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-06-17 16:47:04.610494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.618787: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.634109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:04.638019: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-06-17 16:47:04.647980: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-06-17 16:47:04.655237: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-06-17 16:47:04.670023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-06-17 16:47:04.679152: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-06-17 16:47:04.685911: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:04.690109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:04.693839: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-17 16:47:04.703798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.711944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:05.263458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-17 16:47:05.268143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-17 16:47:05.270931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-17 16:47:05.273788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3983 MB memory) → physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py:3703: UserWarning: Even though the tf.config.experimental_run_functions_eagerly option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use tf.data.experimental.enable.debug_mode().
warnings.warn(
WARNING:tensorflow:input_shape is undefined or non-square, or rows is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
2021-06-17 16:47:06.528692: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:07.040698: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-17 16:47:07.690211: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:08.195097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
WARNING:tensorflow:From D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\ops\array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The validate_indices argument has no effect. Indices are always validated on CPU and never validated on GPU.
2021-06-17 16:47:09.868833: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-06-17 16:47:09.872503: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-06-17 16:47:09.875622: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-06-17 16:47:09.886018: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti64_112.dll’; dlerror: cupti64_112.dll not found
2021-06-17 16:47:09.898400: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti.dll’; dlerror: cupti.dll not found
2021-06-17 16:47:09.902934: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1661] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.910278: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-06-17 16:47:09.914039: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1752] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.953551: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/100
214/214 [==============================] - 60s 275ms/step - loss: 7.0047 - rpn_reg_loss: 0.0160 - rpn_cls_loss: 0.1079 - frcnn_reg_loss: 6.5493 - frcnn_cls_loss: 0.3315 - val_loss: 5.5020 - val_rpn_reg_loss: 0.0161 - val_rpn_cls_loss: 0.1088 - val_frcnn_reg_loss: 5.2288 - val_frcnn_cls_loss: 0.1484

And the error message is:

2021-06-17 16:48:09.849872: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

Suspicious code is:

@tf.function
def rpn_generator(dataset, anchors):
    while True:
        for data in dataset:
            image, gt_boxes, gt_labels = data
            bbox_deltas, bbox_labels = calculate_rpn_actual_outputs(anchors, gt_boxes, gt_labels)
            yield image, (bbox_deltas, bbox_labels)

Last, I referenced https://github.com/FurkanOM/tf-faster-rcnn
It must cause the same problem, because when I run that code, still get it.

human · June 21, 2021, 1:58am

I tried downgrading tensorflow to 2.4.0 and the error still occurred.
The strange thing is that when I run the code, the trained epoch has not consistency.
For example, at first run, train stopped at 2 epoch, and next run, train stopped at 13 epoch.

I thought the problem is my gpu(GTX 1660 Ti) memory, but running the code has taken about 55% of gpu memory.

human · June 21, 2021, 7:04am

I found it.
I gave wrong parameter to call backs tf.keras.callbacks.ReduceLROnPlateau of model.fit() , that’s why the train stopped when epoch ends.
Thanks to all again
But I don’t know why referenced code caused error until now… mysterious…

NADIATI_SALSABILLA · July 3, 2021, 3:27am

where is the code? can be more detail?

mateuszusd · November 18, 2022, 7:02am

I have the same problem, the difference is that this code will report this error as soon as it runs, how to solve this problem, can be detail?

Topic		Replies	Views
After installation. Learning error General Discussion models , datasets , help_request	1	692	January 17, 2022
Ubuntu 21.08 TF GPU problem General Discussion gpu	6	4251	March 22, 2023
Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations TensorFlow api , object-detection , tensorflow	0	41	February 6, 2025
Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-11-09 06:34:26.192631: I tensorflow/stream_executor/cuda/cudart_stub.cc General Discussion tfdata , gpu , library , performance	1	748	November 9, 2023
Tensorflow gpu uses? General Discussion gpu , build-and-install	1	516	December 12, 2023

Error occurred when finalizing GeneratorDataset iterator

Related topics