Unexpected behavior when using batch_jacobian with multiple inputs/outputs in quantum-classical neural network

I’m implementing a neural network that includes quantum layers (using PennyLane’s qml.qnn.KerasLayer) to solve ODEs. I want to encode several points at once and get several results of the ODE (one result per corresponding input) at the same run.
Currently I’m running a toy model where I try to get u(x)=sin(x) according to the loss function: du_dx - cos(x).

The network structure is:

network structure
NN = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(in_out_size,)),
    tf.keras.layers.Dense(n_qubits, activation="tanh"),
    qml.qnn.KerasLayer(qnode, weight_shapes, output_dim=n_qubits),
    tf.keras.layers.Dense(in_out_size)
])

The quantum circuit uses StronglyEntanglingLayers:

quantum circuit
@qml.qnode(dev, diff_method='best')
def qnode(inputs, weights):
    qml.AngleEmbedding(inputs, wires=range(n_qubits))
    qml.templates.StronglyEntanglingLayers(weights, wires=range(n_qubits))
    return [qml.expval(qml.PauliZ(wires=i)) for i in range(n_qubits)]

The gradient method:

gradient method
    def compute_gradients_1st_der_jac(self, inputs):
        with tf.GradientTape(persistent=True) as tape1:
            tape1.watch(inputs)
            outputs = self.model(inputs)
            batch_size = tf.shape(inputs)[0]
            n_features = tf.shape(inputs)[1]
            diagonal_mask = tf.eye(n_features)
        jacobian = tape1.batch_jacobian(outputs, inputs)
        first_derivatives = tf.reduce_sum(jacobian * diagonal_mask, axis=[2])
        del tape1
        return outputs, first_derivatives, first_derivatives

When computing derivatives using batch_jacobian:

  • With in_out_size=1: derivatives correctly correspond to spatial derivatives
  • With in_out_size=n_qubits: derivatives don’t match expected spatial derivatives

Question: Why does increasing input/output dimensions affect the derivative computation, even when the Jacobian’s batches appears diagonal (like in TF’s documentation example: Advanced automatic differentiation  |  TensorFlow Core)?

The same behavior is observed when I change the quantum layer to tf.keras.layers.Dense(n_qubits, activation="tanh") or when I use tf.gradient, so I don’t think it’s due to the quantum circuit or the kind of derivative.

results for the two cases:

Thanks in advance!

2 Likes

I’m implementing a neural network to solve ODEs. I want to encode several points at once and get several results of the ODE (one result per corresponding input) at the same run.
Currently I’m running a toy model where I try to get u(x)=sin(x) according to the loss function: du_dx - cos(x).

The network structure is:

network structure
NN = tf.keras.models.Sequential([
            tf.keras.layers.Input(shape=(self.in_out_size,), name="input"),
            tf.keras.layers.Dense(self.n_nodes, activation="tanh", name="fc_in"),
            tf.keras.layers.Dense(self.n_nodes, activation="tanh", name="fc_mid_1_layer"),
            tf.keras.layers.Dense(self.n_nodes, activation="tanh", name="fc_mid_2_layer"),
            tf.keras.layers.Dense(self.in_out_size, activation=None, name="fc_out")
])

The gradient method:

gradient method
def compute_gradients_1st_der_jac(self, inputs):
        with tf.GradientTape(persistent=True) as tape1:
            tape1.watch(inputs)
            outputs = self.model(inputs)
            batch_size = tf.shape(inputs)[0]
            n_features = tf.shape(inputs)[1]
            diagonal_mask = tf.eye(n_features)
        jacobian = tape1.batch_jacobian(outputs, inputs)
        first_derivatives = tf.reduce_sum(jacobian * diagonal_mask, axis=[2])  # I need only the 
         # diagonal because I need the derivative of the i'th output with respect only to the i'th input)
        del tape1
        return outputs, first_derivatives

When computing derivatives using batch_jacobian:

  • With in_out_size=1: derivatives correctly correspond to spatial derivatives
  • With in_out_size=n_nodes: derivatives don’t match expected spatial derivatives

Question: Why does increasing input/output dimensions affect the derivative computation, even when the Jacobian’s batches appears diagonal (like in TF’s documentation example: Advanced automatic differentiation | TensorFlow Core)?

I also checked if it is something with the batch_jacobian and used a tf.gradient for each of the input output pairs separately (with an outer loop) and those results are exatly the same as the batch_jacobian’s results:

gradient using tf.gradient
def calculate_single_gradient(self, input_tensors, input_index, node_index):
    grad_arr = []
    print("input_index: ", input_index, "node_index: ", node_index)
    for i in range(self.n_batches):
        with tf.GradientTape(persistent=True) as tape:
            # Create a copy of the input tensor for this batch
            # Shape: (1, n_nodes) to maintain the required dimensions
            batch_input = tf.convert_to_tensor(
                input_tensors[i:i + 1],
                dtype=tf.float32
            )
            tape.watch(batch_input)

            # Create a "stop gradient" version of the same tensor
            # This tells TensorFlow not to compute gradients through this tensor
            frozen_input = tf.stop_gradient(batch_input)

            # Now create a mixed tensor where only one input is tracked
            # We use the watched tensor for the node we care about
            # and the frozen tensor for all other nodes
            indices = tf.constant([[0, input_index]])  # Shape: (1, 2) for batch and node index
            updates = tf.gather_nd(batch_input, indices)

            mixed_input = tf.tensor_scatter_nd_update(
                frozen_input,  # Base tensor (gradients stopped)
                indices,  # Where to put our watched value
                updates  # The value we want to track
            )

            # Pass this through the model
            # The mixed_input has the same shape as the original
            # but only one value will produce gradients
            outputs = self.model(mixed_input)

            # Get the specific output we want
            target_output = outputs[0, node_index]

        # Calculate gradient
        gradient = tape.gradient(target_output, batch_input)
        grad_arr.append(gradient)

    return grad_arr

Also, I’ve found that when adding the term of the original_function_loss to the loss_fn in that manner, when beta weights the loss to the original only (beta =1) or the derivative only (beta = 0) or somewhere in between, enables the network to get the accurate results (which is not surprising) by beta as small as 1e-7 (which is very surprising!):

ODE_loss_1 = u_t - self.target_fn_dict["d_u_dx"](t)
original_function_loss = u - self.target_fn_dict["u"](t)
square_loss = (self.beta * tf.square(original_function_loss) + (1 - self.beta) * tf.square(ODE_loss_1))
total_loss = tf.reduce_mean(square_loss)

results for the two cases (without the beta):