I am doing the Tensorflow TF tutorial (Neural machine translation with a Transformer and Keras | Text | TensorFlow) but with my own data. My data is not related to text, but it is sequences of tokens anyway, with a start token, and an end token. All the tokens go from 0 to 30 (start token is 31, end is 32).
My code is very similar to the one from the tutorial, with very small changes:
I am using sparse cross entropy as in the tutorial.
I have a problem where the loss that is displayed as the training goes becomes negative. For example right now it says -2.0593.
I am using Sparse Cross Entropy. When I monitor with a callback the loss of some test batches, the value returned is never negative, but usually some value between 1.5 and 2.
As you can see my last layer in the Transformer is a Dense layer without activation function (as in the tutorial) and therefore in the loss I set “from_logit=True”. I have tried using a softmax activation in that last layer, and then setting “from_logit=False”, but this does not seem to train, and it gets stuck in around a loss of 0.274, and it never moves.
I have no idea why the loss becomes negative, when the loss function always seems to output positive numbers when I test it.
Overall the results of the training are also not good.