Is there a proper way to perform distributed training on TF Probability with mirrored strategy?
When we sample the weights from the Reparameterization layers they will be a lot different between between each device and bc of that this will mess up the loss and gradients as my model replicas out of sync between the devices.
Is there a proper way to set the random seed across all GPUs and sync the models across all nodes?
Tks.
Hello @piEsposito
Thank you for using TensorFlow,
The api tf.distribute.MirroredStrategy documentation, Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable
. These variables are kept in sync with each other by applying identical updates.
tf.random.set_seed(seed)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
tf.random.set_seed(seed)
tf.random.set_seed function is used to set the seed. In a distributed setup, we have to manually set seed to make sure all devices are initialized with same seed.
Thank you.