Hi,
I have been teaching myself the mathematics of backpropagation and been using keras to check my results for errors.
Using a simple modal with input layer with 1 input, dense with 1 neuron (d1), dense with 1 neuron (d2), and output with 1 neuron (o1), I expect the following calculation to have been performed:
(error derivative) * (o1 activation derivative) * (d2 output value) * (d2 activation derivative) * (d1 output value) * (d1 activation derivative) * (input value)
instead, the result comes from the following calculation.
(error derivative) * (o1 activation derivative) * (d1 output value) * (d2 activation derivative) * (d1 activation derivative)
No matter how many layers are added in a row, only the final output layer is being calculated with the output value from the previous neuron connected to it. Why?
Having understood that happens and adjusting for it, I moved on to having more neurons in layers with the following modal:
input layer with 3 inputs, dense with 1 neuron (d1), dense with 4 neurons (d2), and output with 2 neurons (o1), I expect the following calculations to have been performed:
foreach output neuron:
gradient = (error derivative) * (o1 activation derivative) * (d2 output value) foreach output neuron]
foreach d2 neuron:
gradient = (sum of gradient from each output neuron) * (d2 activation derivative)
finally
(sum of gradient from each d2 neuron) * (d1 activation derivative)
Instead, in the final calculation I am seeing:
((sum of gradient from each d2 neuron) / (number of input neurons)) * (d1 activation derivative)
Why is there division when calculating the gradients for weights from input layer?
note: batch size = 1, epochs = 1