Using Sparse Categorical CrossEntropy, the loss becomes negative

I am doing the Tensorflow TF tutorial (Neural machine translation with a Transformer and Keras  |  Text  |  TensorFlow) but with my own data. My data is not related to text, but it is sequences of tokens anyway, with a start token, and an end token. All the tokens go from 0 to 30 (start token is 31, end is 32).

My code is very similar to the one from the tutorial, with very small changes:

I am using sparse cross entropy as in the tutorial.

I have a problem where the loss that is displayed as the training goes becomes negative. For example right now it says -2.0593.

I am using Sparse Cross Entropy. When I monitor with a callback the loss of some test batches, the value returned is never negative, but usually some value between 1.5 and 2.

As you can see my last layer in the Transformer is a Dense layer without activation function (as in the tutorial) and therefore in the loss I set “from_logit=True”. I have tried using a softmax activation in that last layer, and then setting “from_logit=False”, but this does not seem to train, and it gets stuck in around a loss of 0.274, and it never moves.

I have no idea why the loss becomes negative, when the loss function always seems to output positive numbers when I test it.

Overall the results of the training are also not good.

Hi @joanmanel, The loss is just a scalar that you are trying to minimize. It’s not supposed to be positive. Thank You.

Thanks @Kiran_Sai_Ramineni How come when I put from_logits=False, and I apply a softmax at the end layer, the loss doesnt change at atll?

Not an expert in LLMs but as it is described the problem is difficult to debug and impossible to give you more help.

What would be a somewhat useful test is: replacing the custom data with data used in the tutorial, does it train well?

We may need to design the right questions to narrow/track the problem down more precisely in my humble opinion.