When trying to find the optimal batch size, I run multiple experiments with graduallyl augmenting bactch size.
Often, I notice my cache is almost full (a few MB left), but I can still increase batch size by ~25% without any out of memory error, and GPU cache usage will be the same.
example: batchsize=10, GPU almost full.
batchsize=13, GPU almost full, still train fine
Why is that ? Is it a good idea to go with 13 instead of 10 ?