Build tensorflow with GPUS support from source

I’m still trying to build tensorflow with GPU support from source, and not making any progress. I am running Mint 21 (same as ubuntu 22.04). Can anyone who has actually built it explain how to do it? The online docs are either incomplete or so full of “If’s” I don’t know what I am actually supposed to do. I am trying to install the build environment in a conda environment to avoid hosing something important.

I have installed the 535 drivers and nvidia-cuda-toolkit using apt. The following works:

(tftest)$ nvidia-smi
Sat Oct 21 21:33:30 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |

(tftest) $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I ran an nvidia docker image that runs a benchmark, to verify the GPU is working:

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation

GPU Device 0: “Pascal” with compute capability 6.1

Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 7.831 ms
= 133.909 billion interactions per second
= 2678.175 single-precision GFLOP/s at 20 flops per interaction

I have a supported GPU and it works.

I pulled the tensorflow source from the git repository, and after much back and forth with nothing working successfully, I decided to go with the version 2.11 since I have been able to get that to build with CPU only support. I tried to install cuda and cudnn in my conda environment as follows:

conda install -c nvidia cuda-python=11.5
conda install -c nvidia cudnn=8.1
conda install -c nvidia cudatoolkit=11.5

However, ~/anaconda3/envs/tftest/include does NOT have cuda.h. What do I install to get it?
I do have a cuda.h (version 11.5) in /usr/include, I think it must have from from running apt install nvidia-cuda-toolkit. But I don’t want to hose my system by installing a bunch of incompatible or conflicting junk in /usr

When I run configure in the tensorflow source directory, it asks the following questions:

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 11]:

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 2]:

Please specify the locally installed NCCL version you want to use. [Leave empty to use GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication]:

Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr,/home/myname/anaconda3/envs/tftest

What is the correct answers to the questions? No matter what I enter, it will only respond that something is missing, inconsistent, or conflicts with something, then repeats the whole thing again. For example:

Inconsistent CUDA toolkit path: /usr vs /usr/lib
Asking for detailed CUDA configuration…

Hi @lazarus_long

Please try again installing TensorFlow by following the Build from Source link step by step and let us know if the issue still persists. Please check the tested build configuration for compatible CUDA and cuDNN version to TensorFlow 2.11 which will be cuDNN 8.1 and CUDA 11.2. Thank you.

Here is a snippet from my dockerfile on how I actually got mine to work

FROM nvcr.io/nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

#install the required packages using apt install

...

#build and install clang, llvm and bazel

...

RUN python_version=$(python3 --version 2>&1 | awk '{print $2}' | cut -d'.' -f1,2) \
    && echo "export TF_PYTHON_VERSION=$python_version" >> /root/.bashrc


#install tensorrt

...

RUN wget https://github.com/tensorflow/tensorflow/archive/refs/tags/v2.15.0.tar.gz && \
    tar -xf v2.15.0.tar.gz && \
    mv tensorflow-2.15.0 tensorflow && \
    rm v2.15.0.tar.gz

#Set the environment variables for building tensorflow
ENV TF_PYTHON_VERSION=$python_version
ENV TF_CUDNN_VERSION='8'
ENV TF_CUDA_VERSION='12.1'
ENV TF_NEED_TENSORRT=1
ENV PYTHON_BIN_PATH=/usr/bin/python3
ENV PYTHON_LIB_PATH=/usr/lib/python3/dist-packages
ENV TF_NEED_ROCM=0
ENV TF_NEED_CUDA=1
ENV TF_CUDA_COMPUTE_CAPABILITIES='3.5,7.0'
ENV TF_CUDA_CLANG=1
ENV CLANG_CUDA_COMPILER_PATH=/usr/bin/clang
ENV CC_OPT_FLAGS="-Wno-sign-compare"
ENV TF_SET_ANDROID_WORKSPACE=0

WORKDIR /tensorflow
RUN ./configure
RUN bazel build --jobs=$(nproc) --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" -c opt \
    --copt=-Wno-gnu-offsetof-extensions \
    --copt=-Wno-error=unused-command-line-argument \
    --config=noaws --config=nogcp --config=nohdfs \
    --verbose_failures //tensorflow:libtensorflow.so \
    //tensorflow:libtensorflow_cc.so  \
    //tensorflow:libtensorflow_framework.so \
    //tensorflow:install_headers

# build and install protobuf 21.x or lower, trying to link and build your tensorflow code 
# with protobuf version higher than 21.x will drive you to tears.

# move the appropriate header files and library to the appropriate directories. 

svaraikyam

Hello Renu
I am following this link 從原始碼開始建構  |  TensorFlow , but see this error in the beginning itself.
what am I doing wrong here?

Python version 3.9/3.11 and CUDA is 12.2 with NVIDA 535 driver on NVIDIA RTX 3060.

(tf-build) avaish@desktop:/media/avaish/labdisk/tensorflow$ bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda
Starting local Bazel server and connecting to it…
ERROR: Unrecognized option: --repo_env=WHEEL_NAME=tensorflow

regards
Svar