Background
I am part of a team developing a TensorFlow model trained to emulate a physical model. This CNN-based model takes physical parameters and calculates physical properties like flow or temperature. Part of the capabilities involve using automatic differentiation of the forward model to tackle the inverse problem - inferring physical model parameters from observations. While the forward model emulator works very well, I have been trying to diagnose problems with the inverse capabilities that perform far less well than the emulated physical model. One test that I thought would be very simple was to compare the gradients that I get from the gradientTape with gradients calculated by doing finite differencing of small perfurbations in my parameters i.e. a sort of brute force approach. Ultimately these gradients are then used to minimize a cost function, which represents misfit between the forward model and observations. Clearly it is very important that these gradients are very accurate in order to find a good minima.def fwdWrapper(params,state,i,j,h):
state.slidingco[i,j].assign(state.slidingco[i,j] + h)
fieldin = [vars(state)[f] for f in params.iflo_fieldin]
Y = state.forwardModel(fieldin )
cost = misfit(params,state)
state.slidingco[i,j].assign(state.slidingco[i,j] - h)
return cost
def compareGradients(params, state):
with tf.GradientTape() as tape:
tape.watch(state.slidingco)
cost = fwd_wrapper(params, state,0,0,0)
grad = tape.gradient(cost, state.slidingco)
Ny, Nx = state.slidingco.shape
h = 1e-4
dJdC = np.zeros((Ny, Nx))
for i in range(Ny):
for j in range(Nx):
J1 = fwd_wrapper(params, state, i,j,h)
J2 = fwd_wrapper(params, state, i,j,-h)
dJdC[i, j] = (J1 - J2) / (2.0 * h)
So specifically I’m comparing dJdC
with grad
. I have three problems:
- For small
h
which would be desirable to get a more accurate finite difference gradient, the change incost
is zero, and hencedJdC
is also zero, although it is non-zero ingrad
from the gradientTape, and more generally I know it should not be zero. - Where
dJdC
is not zero, it is generally not close tograd
and at best you could say there is some correlation between the two. - In many cases, particularly for small
h
, I find the resultingdJdC
matrix consists of only a few values i.e. the cost seems to change in steps although I would expect it to be continuous.
Both 2. and 3. are improved somewhat by (a) retraining the forward model on my specific setup before running the test and (b) increasing h
(up to a certain point).
Clearly I do not believe there is a problem with Tensorflow’s gradientTape, so my question is what am I missing here? I must be doing something very stupid. The only other thing I can think of that could explain some of this is that I’m running into some precision issues. I should add that I’m not very experienced with Python which might be clear from the above code sample. I’d really appreciate any help as I don’t feel like I can proceed until I convince myself there is not a problem with the gradients or how these are being applied to minimize the cost.