Confusion regarding implementation of `mirrored_run`

cadwgc · February 14, 2022, 4:29am

Hi, I’m currently trying to understand the internal functioning of the MirroredStrategy and the recording of gradients of MirroredVariables in particular.

I understand the concept of the MirroredVariable but it’s unclear to me how a correct gradient tape is recorded over these variables in _call_for_each_replica in mirrored_strategy. As this implementation seems mostly covered by mirrored_run I tried to mainly focus on this file instead. Say we have 1 MirroredVariable with the following signature:

MirroredVariable {
  0: <tf.Variable 'w:0' shape=() dtype=float32>,
  1: <tf.Variable 'w/replica_1:0' shape=() dtype=float32>
}

I’ve tried to understand this behavior by altering the _call_for_each_replica implementation so it runs every function sequentially on the defined device (just removing the replica threads). This works for variable creation, computation and reduction, but breaks when recording gradients. Say I have the following function:

@def_function.function
def step(x):
    with backprop.GradientTape() as tape:
        loss = w * x

    optimizer.minimize(loss, var_list=[w], tape=tape)

strategy.run(step, args=(2.0,))

This yields:

No gradients provided for any variable: ['w:0']

Adding tape.watch(w) doesn’t change anything and my guess is that it’s due to the function wrapping happening in call_for_each_replica in mirrored_run. Could anyone shine some light on how these gradients are recorded here?

Topic		Replies	Views
How are gradients applied in distributed custom loops? General Discussion distributed-training , help_request	1	903	October 16, 2024
Seeing warning saying that I am not using `tf.function` when calling `tf.distribute.strategy.run` however I am General Discussion tf-function	1	330	October 15, 2024
Apply gradient method with mirrored strategy TensorFlow models	3	611	June 9, 2023
Multi-GPU doesn't work for model(inputs) nor when computing the gradients General Discussion datasets , help_request	9	1941	August 18, 2021
All PerReplica Tensors on device GPU:0, backing_device is correct General Discussion distributed-training , gpu	1	300	September 29, 2023

Confusion regarding implementation of `mirrored_run`

Related topics