Applying Batch Normalization on the output layer in a regression problem

Hi all,

Hope you had a great christmas!

Currently I’m experimenting allot with different hyperparameters in a Deep Learning model. My task is to predict parameters (A,B,C,D for example) from a set of X and Y data (each contain n datapoints).

So, what I discovered is that applying a batch normalization layer on to the output layer makes the model predict much (much! (factor 5x)) worse than without a normalization layer on to the output layer. Why is that?

See the picture below; the situation applies when applying an extra batch normalization layer at the position where the arrow points.

I would love to hear your answer.

Update
On stackoverflow this is the only statement i’ve found on this topic: tensorflow - Loss increasing with batch normalization (tf.Keras) - Stack Overflow.

On Quora i’ve found this question: https://www.quora.com/Does-batch-normalization-make-sense-for-regression-problems-or-is-it-mainly-for-classification

Regards,
Stijn

Hi Stijn,

I think it’s quite hard to explain why on your data Batch Normalization is giving a bad result when applied in the last layer.

What helps me understand this kind of behaviour is that when you add a layer, it affects how much information the model can keep. In your model, you basically are using only Dense layers and Normalizations try to decrease sensitivity to changes in the input. Maybe you removed so much sensitivity that the model is not able to learn the nuances of your data. But to have a better understanding of the affects, it would need much more information about your data and some study.

Why specifically are you adding so many BN layers?

Hi Igusm,

Thanks for your response.

I train the Neural Network with N (mostly 10000) simulated datasets each consisting of 1000 data points. Each dataset consists of a set of labels [A,B, C, D] with features [X,Y]. See the (100) datasets graphed below.

The task of my neural network is to predict for each dataset [X,Y], i.e. each line in this graph [A,B,C and D]. X and Y are one-dimensional with a length of 1000 (len(X) = 1000). I concatenate them resulting in a layer of 2000. Both X and Y are already normalized seperately. An example of a fit parameter… that goes pretty well actually, but some parameters are harder to predict than others.

Somewhat less good…

And how good it predicts now?

The goal is to eventually make it increasingly difficult by adding (more and more) noise up to a certain (realistic) limit.

Another model, which I also fit with, but is not necessarily better, is (see the BN layer at the end):

I was not aware that BN layers are capable of lowering sensitivity… I added them because I train with a low batch size (min 32 max 128). BN layers make the training process faster. I thought it was necessary to add so many layers because the assignment seems kinda hard to me (not really experienced). I’m also trying to use around using 2/3 of my input layer as total amount of neurons (2/3 * 2000 = 1333

I have been reading BN combines bad with too low batch sizes. What batch sizes you’d recommend? I’m now sweeping trough batch sizes 16,32,64,128,256.

I’d love to hear from you again.

Update
Adding the latest batch norm layer makes little difference. I cheered too early…
Regards,
Stijn

I don’t have a good recommendation for batch size, again, that depends on many things, for example how much you want to utiliza your resources. If you want to use as much as possible, increasing the batch size is an option

For more information on BN: tf.keras.layers.BatchNormalization  |  TensorFlow v2.16.1

Creating model may take a lot of experimentation.
I’d try some versions without any BN and with less Dense layers just to have a proper understanding of how your model is behaving. Do it until you can overfit on the data. From that, try to fix the overfitting and you may get to a good architecture. This is just one way of doing it, there might be others

some information on overfitting here: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit

Hi Igusm,

Thanks.

My model seems to be never overfitting… It’s just that if I make a complex model it’s overshooting. Training it longer, the validation error and training error just stay constant; on a approximately a single value. I’m applying 4 K-fold validation; the mean sum of all the errors is the thing I’m trying to optimize.

Regards,
Stijn

Hi Igusm,

So, now I’ve created a simple variant and a complex variant. Suprisingly, I turns out, I just need one layer and it fits perfectly. So you’re thought that I was overfitting was probably correct… although I don’t see an increase of my validation error as function of epoch in more complex models.

Do you have any recommendations or things I have to keep in mind when I introduce noise to my data and then let the model fit? Do you think it’s intuitively correct I’d have to add more Neurons or layers for example? Or to generate more data sets for training via my simulation?

Regards,
Stijn