I am using model.fit with use_multiprocessing=True and workers > 1
I am finding that even after the program appears to have completed, the parent process is still using GPU memory (as shown by nvidia-smi). This causes the job submitted through SLURM to never terminate since it thinks the process is still running.
I do not see any errors in the logs.
I have tried using the clear_session() function at the end of the script but it didn’t help.
ps aufx
root 27676 0.0 0.0 401644 4012 ? Sl Aug28 0:05 slurmstepd: [10938928.batch]
svc_oph+ 27687 0.0 0.0 126248 1932 ? S Aug28 0:00 \_ /bin/bash /var/spool/slurmd/job10938928/slurm_script
svc_oph+ 27725 0.4 0.0 825060 12192 ? Sl Aug28 2:34 \_ Apptainer runtime parent
svc_oph+ 29573 73.8 8.2 53064832 21684124 ? Sl Aug28 447:35 \_ /usr/bin/python3 -m ophys_etl.modules.denoising.fine_tuning --input_json /allen/programs/mindscope/production/informatics/ophys_processing/specimen_1281521
svc_oph+ 597 1.4 0.0 0 0 ? Z 04:26 0:30 \_ [python3] <defunct>
svc_oph+ 600 1.0 0.0 0 0 ? Z 04:26 0:21 \_ [python3] <defunct>
svc_oph+ 605 1.1 0.0 0 0 ? Z 04:26 0:23 \_ [python3] <defunct>
svc_oph+ 608 1.0 0.0 0 0 ? Z 04:26 0:22 \_ [python3] <defunct>
svc_oph+ 613 1.0 0.0 0 0 ? Z 04:26 0:22 \_ [python3] <defunct>
svc_oph+ 637 1.1 0.0 0 0 ? Z 04:26 0:22 \_ [python3] <defunct>
svc_oph+ 639 1.4 0.0 0 0 ? Z 04:26 0:28 \_ [python3] <defunct>
svc_oph+ 645 1.3 0.0 0 0 ? Z 04:26 0:26 \_ [python3] <defunct>
svc_oph+ 647 1.0 0.0 0 0 ? Z 04:26 0:21 \_ [python3] <defunct>
svc_oph+ 921 1.0 0.0 0 0 ? Z 04:26 0:21 \_ [python3] <defunct>
svc_oph+ 923 1.1 0.0 0 0 ? Z 04:26 0:23 \_ [python3] <defunct>
svc_oph+ 1022 1.1 0.0 0 0 ? Z 04:26 0:23 \_ [python3] <defunct>
svc_oph+ 1026 1.1 0.0 0 0 ? Z 04:26 0:22 \_ [python3] <defunct>
svc_oph+ 1060 1.4 0.0 0 0 ? Z 04:26 0:28 \_ [python3] <defunct>
svc_oph+ 1062 1.1 0.0 0 0 ? Z 04:26 0:23 \_ [python3] <defunct>
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 29573 C /usr/bin/python3 10910MiB |