Tflite training silently stops on epoch 17/50 and disconnects from google colab

Hello everybody!

I am trying to train object detection model “efficientdet_lite0” on my custom data and when i run

“model = object_detector.create(train_data, model_spec=spec, epochs=50, batch_size=8, train_whole_model=True, validation_data=validation_data)”

command basically from tutorial “Google Colab
on around epoch 17/50 google colaboratory silently stops execution and disconnects. When i run the same script for small amount of data training successfully finishes! I use GPU for training. I use around 8000 images for training.

  1. Maybe there is lack of resources? I can’t see resources usage during training.
  2. Is there any way to break training in batches of smaller amount of epochs and train batch after batch?
  3. Is there any way to export and import intermediate trained model to train in batches?

Any help would be appreciated!

@Sergey_Davydov,

Welcome to the Tensorflow Forum!

This might be high chance of out of memory or resources. You can use the nvidia-smi command to check the same.

You can try to reduce the image size if it works for your case.

Thank you!

Thank you very much Chunduriv! I reduced image size and it worked for me! “nvidia-smi” command is not very helpful to see video memory usage, because i couldn’t see it’s output during training. I just clicked “view resources” and it showed me used resources in real time. I was out of video memory.