How to implement inverting Gradients [PDQN,MPDQN] in Tensorflow 2.7

I am trying to reimplement inverting gradients with gradienttape in tensorflow 2.7. In this example i use the pendulum domain which has an observation size of 3, action size of 1 and no discrete actions.
as shown in this paper:

Someone solved it for tensorflow 1.0 here: python - How to implement inverting gradient in Tensorflow? - Stack Overflow

But i am strugglingin reimplementing it for tensorflow 2.0

As far as i understand we need the derivative of dQ(s,a(w,s)): (dq/da)*(da/dw) with w beeing the weights of the policy network.
This is needed to update the weights of the policy network.
So we can access the derivative dq/da via:

dq_das = tf.Variable(tape.gradient(loss, actions))

Now we can caluculate the inverting gradients. Shape fits with actions and there is no problem here in calculating:


    for i in range(dq_das.shape[0]):
        dq_da = dq_das[i]
        action = actions[i]
        if dq_da >= 0.0:
            dq_das[i].assign(dq_da * (upper - action) / (upper - lower))
            dq_das[i].assign((dq_da * (action - lower) / (upper - lower)))

The derivative da_dw we can access via:

da_dw = tape.gradient(actions, self.policy_Net.trainable_variables)

The problem now is that the shapes don’t fit. If i want to calculate dq_da*da_dw.

For dq_da i get:
<tf.Variable ‘Variable:0’ shape=(124, 1) dtype=float32>

which makes sense since the batchsize is 124 and there is one action. And for da_dw i get:

[<tf.Tensor 'gradient_tape/policy__network/dense/MatMul_1:0' shape=(3, 400) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense/BiasAdd/BiasAddGrad_1:0' shape=(400,) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_1/MatMul_3:0' shape=(403, 300) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_1/BiasAdd/BiasAddGrad_1:0' shape=(300,) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_2/MatMul_3:0' shape=(703, 1) dtype=float32>, <tf.Tensor 'gradient_tape/policy__network/dense_2/BiasAdd/BiasAddGrad_1:0' shape=(1,) dtype=float32>]

Where is my mistake? Thanks a lot!
My Code looks like this so far:

def __inv_Grads__(self,states):
    #states = tf.Variable(states)
    with tf.GradientTape(persistent=True) as tape:
        actions = self.policy_Net(states)
        q,_,_ = self.value_Net(states,actions)
        loss = -tf.reduce_sum(q,axis=1,keepdims=True)
        loss = tf.math.reduce_mean(loss)
    dq_das = tf.Variable(tape.gradient(loss, actions))
    da_dw = tape.gradient(actions, self.policy_Net.trainable_variables)
    inverting_gradients = []

    for i in range(dq_das.shape[0]):
        dq_da = dq_das[i]
        action = actions[i]
        if dq_da >= 0.0:
            dq_das[i].assign(dq_da * (upper - action) / (upper - lower))
            dq_das[i].assign((dq_da * (action - lower) / (upper - lower)))
    return 0