I am trying to reimplement inverting gradients with gradienttape in tensorflow 2.7. In this example i use the pendulum domain which has an observation size of 3, action size of 1 and no discrete actions.
as shown in this paper: https://arxiv.org/pdf/1511.04143.pdf
Someone solved it for tensorflow 1.0 here: python - How to implement inverting gradient in Tensorflow? - Stack Overflow
But i am strugglingin reimplementing it for tensorflow 2.0
As far as i understand we need the derivative of dQ(s,a(w,s)): (dq/da)*(da/dw) with w beeing the weights of the policy network.
This is needed to update the weights of the policy network.
So we can access the derivative dq/da via:
dq_das = tf.Variable(tape.gradient(loss, actions))
Now we can caluculate the inverting gradients. Shape fits with actions and there is no problem here in calculating:
upper=1
lower=-1
for i in range(dq_das.shape[0]):
dq_da = dq_das[i]
action = actions[i]
if dq_da >= 0.0:
dq_das[i].assign(dq_da * (upper - action) / (upper - lower))
else:
dq_das[i].assign((dq_da * (action - lower) / (upper - lower)))
The derivative da_dw we can access via:
da_dw = tape.gradient(actions, self.policy_Net.trainable_variables)
The problem now is that the shapes don’t fit. If i want to calculate dq_da*da_dw.
For dq_da i get:
<tf.Variable ‘Variable:0’ shape=(124, 1) dtype=float32>
which makes sense since the batchsize is 124 and there is one action. And for da_dw i get:
[<tf.Tensor 'gradient_tape/policy__network/dense/MatMul_1:0' shape=(3, 400) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense/BiasAdd/BiasAddGrad_1:0' shape=(400,) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_1/MatMul_3:0' shape=(403, 300) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_1/BiasAdd/BiasAddGrad_1:0' shape=(300,) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_2/MatMul_3:0' shape=(703, 1) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_2/BiasAdd/BiasAddGrad_1:0' shape=(1,) dtype=float32>]
Where is my mistake? Thanks a lot!
My Code looks like this so far:
@tf.function
def __inv_Grads__(self,states):
#states = tf.Variable(states)
with tf.GradientTape(persistent=True) as tape:
actions = self.policy_Net(states)
q,_,_ = self.value_Net(states,actions)
loss = -tf.reduce_sum(q,axis=1,keepdims=True)
loss = tf.math.reduce_mean(loss)
dq_das = tf.Variable(tape.gradient(loss, actions))
da_dw = tape.gradient(actions, self.policy_Net.trainable_variables)
inverting_gradients = []
upper=1
lower=-1
for i in range(dq_das.shape[0]):
dq_da = dq_das[i]
action = actions[i]
if dq_da >= 0.0:
dq_das[i].assign(dq_da * (upper - action) / (upper - lower))
else:
dq_das[i].assign((dq_da * (action - lower) / (upper - lower)))
print(dq_das)
print(da_dw)
print(dq_das*da_dw)
exit()
return 0