Train model in other machine using data from the main machine

Setup

  • A server- A with 150 TB
  • A NVIDA-DGX with 4 GPUs connected to same wifi to server A but not same machine.

Problem we are facing:

  • The entire training data is in the server A and the DGX has the computing power.
  • We are trying to find a way to train a model on DGX while using data from server A.
  • We tried to use Tensorflow distributed learning and specify the dgx ip as the cluster worker, the code ran and it specified as a grpc server running at tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://ip_address:port from server A and the model started training.
  • But there is no GPU utilization in the DGX nor there is any process running.

Is there any good resource which explicitly says about training on remote GPU machines from storage host servers, locally, without using any cloud pipelines like GCP, or any changes to our above config.

Hello @Saish

Thanks for using TensorFlow,

Could you please make sure the tensorflow is able to access GPU’s by using the command

tf.config.list_physical_devices('GPU')

Ping from server A to DGX and make sure it’s receiving and also check for the DGX could ping to server A, Setup SSH file system in DGX to make data accessible to both servers,
Please update your TF version to the latest and use the update tutorial.
Thank you