Hi all,
Hope you had a great christmas!
Currently I’m experimenting allot with different hyperparameters in a Deep Learning model. My task is to predict parameters (A,B,C,D for example) from a set of X and Y data (each contain n datapoints).
So, what I discovered is that applying a batch normalization layer on to the output layer makes the model predict much (much! (factor 5x)) worse than without a normalization layer on to the output layer. Why is that?
See the picture below; the situation applies when applying an extra batch normalization layer at the position where the arrow points.
I would love to hear your answer.
Update
On stackoverflow this is the only statement i’ve found on this topic: tensorflow - Loss increasing with batch normalization (tf.Keras) - Stack Overflow.
On Quora i’ve found this question: https://www.quora.com/Does-batch-normalization-make-sense-for-regression-problems-or-is-it-mainly-for-classification
Regards,
Stijn