Weird automatic stopping behavior in TF 2.11 GPU

shankar_B · March 15, 2024, 11:24am

While training a model using TF 2.11, the training gets stopped periodically without any stack trace. Only the statement ‘Killed’ gets written to the console. For Tesla T4 GPU having 16 GB GPU memory, the training stops at epoch 8, while for an A10G GPU with 24 GB GPU memory, the training stops at epoch 11. If the training is then resumed from epoch 11, it again stops at epoch 21, and again at epoch 31. Would like to know if someone else has observed a similar behavior or any other memory leak related issues with this version of TF or keras.
Following are the package versions:
tensorflow 2.11.0 cuda112py39h01bd6f0_0 conda-forge
tensorflow-base 2.11.0 cuda112py39haa5674d_0 conda-forge
tensorflow-estimator 2.11.0 cuda112py39h11d7a3b_0 conda-forge
tensorflow-gpu 2.11.0 cuda112py39h0bbbad9_0 conda-forge
tensorflow-io 0.31.0 pypi_0 pypi
keras 2.11.0 pyhd8ed1ab_0 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
Any other ideas to root cause this issue are also welcome.

Renu_Patel · March 22, 2024, 3:04pm

Hi @shankar_B

Welcome to the TensorFlow Forum!

How did you install the TensorFlow in your system? Please refer to this TF install official link to install TensorFlow as per your system OS. Also, Please verify if you have installed the compatible version of CUDA, cuDNN as installed TensorFlow, Python version in your system by checking this TF tested build configuration.

Let us know if the issue still persists. Thank you.

Topic		Replies	Views
Python crashes when I run tf.random.normal([1000, 1000]) in TensorFlow 2 General Discussion gpu , help_request , tfcore	4	3521	October 19, 2021
CNN network causes kernel die General Discussion keras , wsl2	20	2312	June 6, 2024
TF encountered strange errors when using GPU General Discussion gpu , tensorflow	1	69	May 19, 2024
Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered General Discussion gpu	1	1318	October 17, 2023
I have problem with train models in tensorflow General Discussion models , tensorflow	8	593	December 6, 2023

Weird automatic stopping behavior in TF 2.11 GPU

Related topics