I’m trying to implement FasterRCNN in Keras and having enormous difficulty doing so. I believe I am encountering some sort of numerical stability issue. When I try to train only the region proposal network (VGG16 + a couple convolution layers tacked on producing two output maps) on a small subset of images, training progress as measured by recall % is very slow and oscillates quite a bit.
As I enable clipnorm, with values like 8, 4, and 1, behavior improves. I don’t see losses exploding or turning into NaNs in any case and am at a loss as to how to debug the issue.
I suspect the issue must involve my loss functions but they are relatively simple (apart from a lot of reshaping of tensors). What would be a good way to debug this?
I’ve thought of printing out the gradient norm per training sample but am not sure how to obtain the gradient in the first place. Most of the examples I’ve seen online no longer work with TF 2.0.
Is that based on FasterRCNN? I’m trying to replicate on Pascal VOC2007. I did test against a simple Pytorch version which converges very fast as expected. I don’t see a major difference in my code except that my ground truth values are stored in a large map and so my loss functions have to perform a bit of tensor slicing and reshaping to get at the y_true values.
Thanks, I will try to dig in and see if I can run the RPN portion but my objective is really to replicate this on my own. I have an implementation that does learn but the mAP is low and convergence takes much longer than it should. I suspect numerical instability.
The chief differences in my implementation are:
VGG16 backbone (none of the TF ones seem to use this).
Single class output in the RPN per anchor (objectness score) with sigmoid rather than two (object score, background score) with soft max. Background score is unused by the model anyway.
My truth values come in a more complex form because I pass a single large map stuffed with class and regression ground truth data.
I think it would be helpful to be able to print out the norm of the gradient but is there an example of how to do this with the current Keras API?
If I can’t resolve this in the next day or two, I will try to write a minimalistic RPN implementation in both Keras and PyTorch to confirm whether the problem is Keras.