Hi everyone,
I am investigating a problem that seems to be related between a mismatch between tensorflow and cudatoolkit installed via conda. I was trying to run the code from the simcl official repository. I installed a tensorflow using a conda environment created following the official documentation. When I tried to run a training as specified here in the pretraining session using a Single GPU configuration, I started to see a JIT Compilation error and a failure related to libdevice not being found, as we can see below:
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
Traceback (most recent call last):
File "/home/matheus/development/simclr/tf2/run.py", line 671, in <module>
app.run(main)
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/matheus/development/simclr/tf2/run.py", line 647, in main
train_multiple_steps(iterator)
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
Detected at node 'mod' defined at (most recent call last):
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/matheus/development/simclr/tf2/run.py", line 572, in single_step
should_record = tf.equal((optimizer.iterations + 1) % steps_per_loop, 0)
Node: 'mod'
Detected at node 'mod' defined at (most recent call last):
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/matheus/development/simclr/tf2/run.py", line 572, in single_step
should_record = tf.equal((optimizer.iterations + 1) % steps_per_loop, 0)
Node: 'mod'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node mod}}]]
[[Func/while/body/_1/image/write_summary/summary_cond/then/_894/input/_907/_26]]
(1) UNKNOWN: JIT compilation failed.
[[{{node mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_multiple_steps_17859]
After a lot of back and forth, it looks like the problem is with the conda package of cuda toolkit. It looks like tensorflow looks for a tool called libdevice in the directory ${CUDA_DIR}/nvvm/libdevice
, as we can see here. The main problem is that tensorflow seems look for cuda at /usr/local/cuda
according to this file.
If that is true, how tensorflow is able to look at cudatoolkit installed using conda since the binaries are stored in a different path, like ~/miniconda/envs/{my_env}/lib
?
Also, I was taking a look at the conda-forge cudatoolkit repository and found something interesting. Looks like the package copies the file /nvvm/libdevice
directly into the lib folder of cudatoolkit and tensorflow is not able to find it later because it does not keep the folder structure. Does that make sense?
I am interested in contribute to a solution for this issue with a PR if it is the case.
Appreciate any help here.