Firstly, some details on my software/ hardware:
TF 2.5.0
TF model maker 0.3.2
GPU RX3080 16GB
32GB of DRAM
Ubuntu 18.04 LTS
Nvidia Driver Version: 471.68 CUDA Version: 11.4
I have been trying to train a custom model using “efficientdet_lite1” spec but somehow it always hang randomly mid way in
model = object_detector.create() call
and not always in the same epoch runs:
e.g.
109/176 [=================>…] - ETA: 46s - det_loss: 0.4381 - cls_loss: 0.2731…
I tried to debug by adding
tf.get_logger().setLevel(‘DEBUG’)
but nothing is printed when the hang happens.
Any hint welcome?
I have also tried TF 2.6.0, nightly-builds, etc