How does MultiWorkerMirroredStrategy works?

wzz · April 18, 2022, 1:10am

We are training to run a distributed training on cluster with just CPUs. After reading the tutorials we choose to use tf.distribute.MultiWorkerMirroredStrategy. But there are something confusing us. It said that I need to prepare the same code on every work and this strategy will send all the model, checkpoint and dataset to every worker. But dose It sent the data of the sample or just the index of every sample? Do I need to prepare model, checkpoint and whole dataset on every worker? I hope that the chef worker can load all data from itself and sent what others need to every worker so that other workers don’t have to prepare training data. It’s not easy for us to put all data on every worker limit to cluster using rules of our business.
We try to only load checkpoint on chef worker and the program doesn’t work. We also try to load the whole dataset on chef worker and load just a part of dataset on other workers and it doesn’t work.

Jetti_Bharat · September 11, 2024, 9:33am

Hello @wzz
Thank you for using TensorFlow,

As per the TensorFlow documentation of custom training loop and definition of MultiWorkerMirroredStrategy, It uses a common storage path where all workers could access data instead of storing data in every worker, as it would raise storage space issues, In your case you may use cloud storage or data pipelines and preferably create data pipeline with tf.data.dataset, as it has ability to effectively process data and make shards if needed.
Model checkpoints would stored in common storage and chief worker is assigned with this task in this tutorial.

Thank you.

Topic		Replies	Views
MultiWorkerMirroredStrategy with distributed dataset question General Discussion distributed-training , gpu	2	349	November 27, 2023
What AutoShardPolicy to use for distributed training with multiple workers? General Discussion datasets , help_request	2	498	June 29, 2022
Difference between MultiWorkerMirroredStrategy and ParameterServerStrategy General Discussion datasets , help_request	1	957	November 28, 2023
Update all worker replicas from one worker using MultiWorkerMirroredStrategy General Discussion distributed-training , help_request	1	885	October 7, 2024
Question: Multi-worker training with keras General Discussion distributed-training	1	265	November 23, 2023

How does MultiWorkerMirroredStrategy works?

Related topics