Model.fit using multiprocessing=True program does not terminate

I am using model.fit with use_multiprocessing=True and workers > 1

I am finding that even after the program appears to have completed, the parent process is still using GPU memory (as shown by nvidia-smi). This causes the job submitted through SLURM to never terminate since it thinks the process is still running.

I do not see any errors in the logs.

I have tried using the clear_session() function at the end of the script but it didn’t help.

ps aufx

root     27676  0.0  0.0 401644  4012 ?        Sl   Aug28   0:05 slurmstepd: [10938928.batch]
svc_oph+ 27687  0.0  0.0 126248  1932 ?        S    Aug28   0:00  \_ /bin/bash /var/spool/slurmd/job10938928/slurm_script
svc_oph+ 27725  0.4  0.0 825060 12192 ?        Sl   Aug28   2:34      \_ Apptainer runtime parent
svc_oph+ 29573 73.8  8.2 53064832 21684124 ?   Sl   Aug28 447:35          \_ /usr/bin/python3 -m ophys_etl.modules.denoising.fine_tuning --input_json /allen/programs/mindscope/production/informatics/ophys_processing/specimen_1281521
svc_oph+   597  1.4  0.0      0     0 ?        Z    04:26   0:30              \_ [python3] <defunct>
svc_oph+   600  1.0  0.0      0     0 ?        Z    04:26   0:21              \_ [python3] <defunct>
svc_oph+   605  1.1  0.0      0     0 ?        Z    04:26   0:23              \_ [python3] <defunct>
svc_oph+   608  1.0  0.0      0     0 ?        Z    04:26   0:22              \_ [python3] <defunct>
svc_oph+   613  1.0  0.0      0     0 ?        Z    04:26   0:22              \_ [python3] <defunct>
svc_oph+   637  1.1  0.0      0     0 ?        Z    04:26   0:22              \_ [python3] <defunct>
svc_oph+   639  1.4  0.0      0     0 ?        Z    04:26   0:28              \_ [python3] <defunct>
svc_oph+   645  1.3  0.0      0     0 ?        Z    04:26   0:26              \_ [python3] <defunct>
svc_oph+   647  1.0  0.0      0     0 ?        Z    04:26   0:21              \_ [python3] <defunct>
svc_oph+   921  1.0  0.0      0     0 ?        Z    04:26   0:21              \_ [python3] <defunct>
svc_oph+   923  1.1  0.0      0     0 ?        Z    04:26   0:23              \_ [python3] <defunct>
svc_oph+  1022  1.1  0.0      0     0 ?        Z    04:26   0:23              \_ [python3] <defunct>
svc_oph+  1026  1.1  0.0      0     0 ?        Z    04:26   0:22              \_ [python3] <defunct>
svc_oph+  1060  1.4  0.0      0     0 ?        Z    04:26   0:28              \_ [python3] <defunct>
svc_oph+  1062  1.1  0.0      0     0 ?        Z    04:26   0:23              \_ [python3] <defunct>
 
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     29573      C   /usr/bin/python3                          10910MiB |

Hi @cosmosaa ,

Can you try this command tf.keras.backend.clear_session() instead of clear_session(). This is a more recent function that is specifically designed to release GPU memory and meanwhile you can refer this thread in github to get workaround if above command not solved.

Please let me know if it helps.

Thanks.