Hi again
Thanks to all of your help, I can build Faster-RCNN model
.
But it goes well except training step, and I hit the wall.
I debugged functions, so found a suspicious part, however I can’t catch what is root cause.
First, the version is:
tensorflow-gpu==2.5.0
CUDA==11.2.0
cuDNN==8.1.0.77
The whole tack trace is:
2021-06-17 16:46:58.163220: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.565562: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-06-17 16:47:04.610494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.618787: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.634109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:04.638019: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-06-17 16:47:04.647980: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-06-17 16:47:04.655237: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-06-17 16:47:04.670023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-06-17 16:47:04.679152: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-06-17 16:47:04.685911: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:04.690109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:04.693839: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-17 16:47:04.703798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.711944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:05.263458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-17 16:47:05.268143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-17 16:47:05.270931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-17 16:47:05.273788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3983 MB memory) → physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py:3703: UserWarning: Even though thetf.config.experimental_run_functions_eagerly
option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please usetf.data.experimental.enable.debug_mode()
.
warnings.warn(
WARNING:tensorflow:input_shape
is undefined or non-square, orrows
is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
2021-06-17 16:47:06.528692: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:07.040698: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-17 16:47:07.690211: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:08.195097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
WARNING:tensorflow:From D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\ops\array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
Thevalidate_indices
argument has no effect. Indices are always validated on CPU and never validated on GPU.
2021-06-17 16:47:09.868833: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-06-17 16:47:09.872503: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-06-17 16:47:09.875622: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-06-17 16:47:09.886018: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti64_112.dll’; dlerror: cupti64_112.dll not found
2021-06-17 16:47:09.898400: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti.dll’; dlerror: cupti.dll not found
2021-06-17 16:47:09.902934: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1661] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.910278: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-06-17 16:47:09.914039: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1752] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.953551: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/100
214/214 [==============================] - 60s 275ms/step - loss: 7.0047 - rpn_reg_loss: 0.0160 - rpn_cls_loss: 0.1079 - frcnn_reg_loss: 6.5493 - frcnn_cls_loss: 0.3315 - val_loss: 5.5020 - val_rpn_reg_loss: 0.0161 - val_rpn_cls_loss: 0.1088 - val_frcnn_reg_loss: 5.2288 - val_frcnn_cls_loss: 0.1484
And the error message is:
2021-06-17 16:48:09.849872: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
Suspicious code is:
@tf.function def rpn_generator(dataset, anchors): while True: for data in dataset: image, gt_boxes, gt_labels = data bbox_deltas, bbox_labels = calculate_rpn_actual_outputs(anchors, gt_boxes, gt_labels) yield image, (bbox_deltas, bbox_labels)
Last, I referenced https://github.com/FurkanOM/tf-faster-rcnn
It must cause the same problem, because when I run that code, still get it.