Yes, absolutely.
But I guess the main thing is just that it turns out I just don’t really understand stochastic depth.
Your code and the original implementation are doing the same thing. And now I’m just trying to understand why.
Original:
if self.train then
if self.gate then -- only compute convolutional output when gate is open
self.output:add(self.net:forward(input))
end
else
self.output:add(self.net:forward(input):mul(1-self.deathRate))
Yours:
if training:
keep_prob = 1 - self.drop_prob
shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)
random_tensor = tf.floor(random_tensor)
return (x / keep_prob) * random_tensor
return x
The part I don’t understand is the keep_prob
scaling.
# inference branch
self.output:add(self.net:forward(input):mul(1-self.deathRate))
# Training branch
return (x / keep_prob) * random_tensor
These are equivalent, but I don’t understand why this line is there.
I understand the argument for this scaling in dropout: “Show me half the pixels twice as bright during training and then all the pixels for inference.”
But I’m less comfortable with applying thos logic to the entire example. “Skip the operation or do it twice as hard, and for inference do it with regular strength.”
But maybe I can understand it with the “the layers of a resnet are a like gradient vector field pushing the embedding towards the answer” interpretation. I guess If I’m taking fewer steps each could be larger.
Which factor? Could you provide a short snippet?
My little experiment was, in your code, to just replace this line:
return (x / keep_prob) * random_tensor
with:
return x * random_tensor
I’ll run it a few more times and see what happens.