Train model in other machine using data from the main machine

Saish · August 11, 2021, 7:38am

Setup

A server- A with 150 TB
A NVIDA-DGX with 4 GPUs connected to same wifi to server A but not same machine.

Problem we are facing:

The entire training data is in the server A and the DGX has the computing power.
We are trying to find a way to train a model on DGX while using data from server A.
We tried to use Tensorflow distributed learning and specify the dgx ip as the cluster worker, the code ran and it specified as a grpc server running at tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://ip_address:port from server A and the model started training.
But there is no GPU utilization in the DGX nor there is any process running.

Is there any good resource which explicitly says about training on remote GPU machines from storage host servers, locally, without using any cloud pipelines like GCP, or any changes to our above config.

Jetti_Bharat · September 13, 2024, 2:55pm

Hello @Saish

Thanks for using TensorFlow,

Could you please make sure the tensorflow is able to access GPU’s by using the command

tf.config.list_physical_devices('GPU')

Ping from server A to DGX and make sure it’s receiving and also check for the DGX could ping to server A, Setup SSH file system in DGX to make data accessible to both servers,
Please update your TF version to the latest and use the update tutorial.
Thank you

Topic		Replies	Views
Distributed ParameterServer setup TensorFlow distributed-training	1	355	January 18, 2024
MultiWorkerMirroredStrategy General Discussion distributed-training , gpu , help_request	1	1520	January 2, 2024
Runinng tf.distribute.MultiWorkerMirroredStrategy TensorFlow distributed-training	0	258	February 6, 2025
Distributed Training in TensorFlow with AI Platform & Docker Show and Tell contributing , learning , ml_ops , education	3	1629	November 5, 2021
Get stuck on running distributed training using MultiWorkerMirroredStrategy General Discussion models , distributed-training , gpu , help_request	1	2303	September 12, 2024

Train model in other machine using data from the main machine

Setup

Problem we are facing:

Related topics