I have seen some code uses the combination of log_softmax
in the Dense layer with from_logits=True
in the cross-entropy loss function in order to have a stable computation of the softmax. How does it compare with using linear activation in the Dense layer with from_logits=True
in the cross-entropy loss function? Isn’t there duplicate “softmax” in the first case of using log_softmax
in the Dense layer since the cross-entropy loss function will perform the softmax calculation if from_logits=True
?