Gradients being computed incorrectly with custom loss function

rutledgea20 · August 11, 2022, 2:12am

I am attempting to train an LSTM with a custom loss function. The model’s goal is to read in some sequential noisy points on a curve, and generate coefficients for an nth degree polynomial that fits the curve. The curve has multiple variables, x, y and z, that all depend on time. The model will generate one set of coefficients for each variable such that when plotted over time, the polynomial is a best fit for its respective set of points.

The input to the model is shape (batch_size, sequence_length, num_variables) and it outputs a tensor with shape (batch_size, num_variables, polynomial_degree). My custom loss function compares that output to a set of points known to lie on the polynomial being estimated, and returns a loss tensor with shape (batch_size).

Here is my model definition:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=float(-10), 
    input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30), 
    kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))

model.compile(loss=batched_polynomial_loss, optimizer=tf.keras.optimizers.Adam())

model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)

Here is my loss function (tf_power_outer is a replication of numpy.power.outer):

def tf_power_outer(elem):
    powers = tf.convert_to_tensor(np.arange(polynomial_model_degree), dtype=tf.float32)
    expand_elem = tf.convert_to_tensor(np.full((polynomial_model_degree), 1.0), dtype=tf.float32)
    expanded_elem = tf.tensordot(expand_elem, elem, axes=0)
    expand_powers = tf.convert_to_tensor(np.full((100), 1.0), dtype=tf.float32)
    expanded_powers = tf.tensordot(powers, expand_powers, axes=0)
    return tf.pow(expanded_elem, expanded_powers)


def batched_polynomial_loss(y_true, y_pred):
    true_tracks = tf.transpose(y_true, [0, 2, 1])
    tt_shape = tf.shape(true_tracks)
    times = tf.slice(true_tracks, [0, tt_shape[1] - 1, 0], [tt_shape[0], 1, tt_shape[2]])
    times = tf.squeeze(times, axis=1)
    variables = tf.slice(true_tracks, [0, 0, 0], [tt_shape[0], tt_shape[1] - 1, tt_shape[2]])
    time_mats = tf.map_fn(tf_power_outer, times)
    guess_data = tf.matmul(y_pred, time_mats)
    mse = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)
    full_error = mse(variables, guess_data)
    reduced_error = tf.math.reduce_mean(full_error, axis=1)
    return reduced_error

It was not working with model.fit, so I wrote this training loop to inspect it more closely:

opt = tf.keras.optimizers.Adam()
def step(X, y):
    with tf.GradientTape() as tape:
        pred = model(X)
        loss = batched_polynomial_loss(y, pred)
    print(loss)
    grads = tape.gradient(loss, model.trainable_variables)
    print(grads)
    opt.apply_gradients(zip(grads, model.trainable_variables))


step(train_X[0:32], train_y[0:32])
step(train_X[32:64], train_y[32:64])
step(train_X[64:96], train_y[64:96])

On the first iteration, loss looks like:

tf.Tensor(
[0.43734643 0.4373459 0.43734697 0.4373469 0.43734667 0.4373454
0.4373459 0.4373479 0.43734646 0.43734646 0.4373469 0.43734574
0.43734762 0.43734512 0.43734625 0.4373463 0.4373455 0.4373475
0.43734655 0.43734694 0.43734714 0.43734702 0.43734705 0.4373466
0.43734702 0.43734536 0.43734682 0.43734637 0.43734694 0.4373466
0.43734622 0.4373466 ], shape=(32,), dtype=float32)

and grads looks like:

[<tf.Tensor: shape=(11, 128), dtype=float32, numpy=
array([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(32, 128), dtype=float32, numpy=
array([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(128,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(32, 64), dtype=float32, numpy=
array([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(64,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(64, 30), dtype=float32, numpy=
array([[ 8.4006101e-02, -5.3240155e+02, -1.1512082e+06, …,
-4.7750685e+22, -8.7685334e+25, -1.6336383e+29],
[ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, …,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
[ 1.1525370e-02, -7.3044151e+01, -1.5794281e+05, …,
-6.5512637e+21, -1.2030189e+25, -2.2413071e+28],
…,
[ 3.0950422e-04, -1.9615617e+00, -4.2414673e+03, …,
-1.7593081e+20, -3.2306457e+23, -6.0189166e+26],
[ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, …,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
[ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, …,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00]], dtype=float32)>, <tf.Tensor: shape=(30,), dtype=float32, numpy=
array([ 6.0670018e-01, -3.8450579e+03, -8.3141405e+06, -1.5660871e+10,
-2.8943713e+13, -5.3658182e+16, -1.0029243e+20, -1.8912737e+23,
-3.5962617e+26, -6.8893381e+29, 1.6471731e+01, 1.5045589e+04,
1.8590212e+07, 2.6241573e+10, 4.0149233e+13, 6.4911980e+16,
1.0927340e+20, 1.8970660e+23, 3.3739135e+26, 6.1173324e+29,
-1.5370066e+01, -1.7797381e+04, -2.5041262e+07, -3.8708691e+10,
-6.3491046e+13, -1.0852857e+17, -1.9123389e+20, -3.4486026e+23,
-6.3327237e+26, -1.1798300e+30], dtype=float32)>]

In the second iteration, loss and grads have the same shapes, but every single value in both is “nan”.

I don’t see anything wrong with the loss that was computed in the first iteration, an array of values all ~0.43 seems normal for an untrained model. But the gradients that get computed after this are weirdly huge, I see one over 1e+30. Is my loss function formatted incorrectly? Why are the gradients getting calculated so weirdly?

Jetti_Bharat · November 28, 2024, 4:34am

Hello @rutledgea20

In the code snippet provided, In my view, instead of using tf.tensordot for expanding and multiplying powers, you can use broadcasting directly to calculate the power matrix.

expanded_elem = tf.reshape(elem, (-1, 1)) 
expanded_powers = tf.reshape(powers, (1, -1))

if gradients are still None, it might be due to an operation in the function that doesn’t support gradient computation, or a dimension mismatch that prevents gradient propagation. To solve this we can add log statements and see each and every operation output and find the bug.

Thank you.

Topic		Replies	Views
Different results between fit and appy_grad General Discussion api , help_request	0	454	July 25, 2022
Model does not train properly when explicitly applying the gradients General Discussion models , keras , help_request	1	657	August 2, 2022
Custom loss producing nans after a few epochs TensorFlow models , help_request	3	1893	April 16, 2023
Loss is decreasing very slowly when custom loss function is used General Discussion models , keras , help_request	1	1216	October 11, 2022
GradientTape on eager mode General Discussion models , keras , help_request	1	723	July 6, 2021

Gradients being computed incorrectly with custom loss function

Related topics