Hello,
I managed to execute a simple distributed training using this code. I followed the documentation and several blogs and I documented it here.
But I couldn’t find any instructions to use separate VMs and setup a truly distributed training. I understand some cost is involved but my goal is to experiment a simple set up. Am I right in assuming the subprocess in the code is not a full-fledged distributed setup ?
Other have set this up. Can you help ?
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
# The cluster spec is a dictionary with one key per job,
# and the values are lists of task addresses (IP:port)
cluster_spec = { "worker":["127.0.0.1:9901",
"127.0.0.1:9902"]
}
# set the TF_CONFIG environment variable before starting TensorFlow
# JSON-encoded dictionary containing a cluster specification (under the "cluster" key)
# and the type and index of the current task (under the "task" key)
for index, worker_address in enumerate( cluster_spec["worker"] ):
os.environ['CUDA_VISIBLE_DEVICES']=str(index)
os.environ["TF_CONFIG"] = json.dumps( { "cluster":cluster_spec,
"task":{"type":"worker",
"index": index}
} )
subprocess.Popen( "python /home/jupyter/task.py",
shell = True)