Retracing with Distributed Training

King_Gee · June 29, 2022, 2:26am

Hi guys,

I am trying a custom model with distributed training on multiple GPU with tf.function(). The graph tends to compile at every call. To solve the issue i passed the input_signature argument with specified tf.TensorSpec() on the tf.function() which works fine for 1 gpu, however when i use multiple gpus, it returns the error ‘Perreplica does not have dtype’.

Please any idea, how i an solve this problem?

Jetti_Bharat · October 11, 2024, 8:42am

Hello @King_Gee

Thank you for using TensorFlow
In the training step, add following line to avoid the perreplica issues for dtypes, as mentioned in the documentation
per_replica_inputs = strategy.experimental_local_results(inputs)
and give the per_replica_inputs to model as a list.

Topic		Replies	Views
Graph mode input signature for distributed dataset General Discussion datasets	1	290	October 17, 2024
All PerReplica Tensors on device GPU:0, backing_device is correct General Discussion distributed-training , gpu	1	297	September 29, 2023
TF Probability distributed training? General Discussion models , distributed-training , gpu , help_request , tf-probability	1	1381	September 13, 2024
Multi-GPU doesn't work for model(inputs) nor when computing the gradients General Discussion datasets , help_request	9	1935	August 18, 2021
Custom model trigerring retracing warning General Discussion help_request	9	5136	December 7, 2022

Retracing with Distributed Training

Related topics