We are training to run a distributed training on cluster with just CPUs. After reading the tutorials we choose to use tf.distribute.MultiWorkerMirroredStrategy. But there are something confusing us. It said that I need to prepare the same code on every work and this strategy will send all the model, checkpoint and dataset to every worker. But dose It sent the data of the sample or just the index of every sample? Do I need to prepare model, checkpoint and whole dataset on every worker? I hope that the chef worker can load all data from itself and sent what others need to every worker so that other workers don’t have to prepare training data. It’s not easy for us to put all data on every worker limit to cluster using rules of our business.
We try to only load checkpoint on chef worker and the program doesn’t work. We also try to load the whole dataset on chef worker and load just a part of dataset on other workers and it doesn’t work.
Hello @wzz
Thank you for using TensorFlow,
As per the TensorFlow documentation of custom training loop and definition of MultiWorkerMirroredStrategy, It uses a common storage path where all workers could access data instead of storing data in every worker, as it would raise storage space issues, In your case you may use cloud storage or data pipelines and preferably create data pipeline with tf.data.dataset, as it has ability to effectively process data and make shards if needed.
Model checkpoints would stored in common storage and chief worker is assigned with this task in this tutorial.
Thank you.