Discrete Change In Batch Size Causes Gradients To Be Exactly 0

g-w1 · June 22, 2023, 12:35am

Hello everyone,
I am working on a model using deepONet from deepXDE, which uses keras behind the scenes, and am finding some REALLY weird behavior.
I am researching the effects of switching to mixed_float16 on model memory, speed, and accuracy. Something really weird was happening that with a full batch size, the gradients were 0 and the model wasn’t updating from iteration to iteration.
However, when I brought down the batch size to below a certain number (for one model it was 512 items, another was 656), then the model could train again and the gradients were not zero.
I have spent a LOT of time looking into this really weird behavior and was wondering if anyone knows this behavior and knows what’s wrong?
For example, with this code: ```python
tf.keras.mixed_precision.set_global_policy(‘mixed_float16’)
dde.config.set_default_float(‘float16’)

Load dataset

d = np.load(“antiderivative_aligned_train.npz”, allow_pickle=True)
X_train = (d[“X”][0].astype(np.float), d[“X”][1].astype(np.float16))
y_train = d[“y”].astype(np.float16)
d = np.load(“antiderivative_aligned_test.npz”, allow_pickle=True)
X_test = (d[“X”][0].astype(np.float16), d[“X”][1].astype(np.float16))
y_test = d[“y”].astype(np.float16)

data = dde.data.TripleCartesianProd(
X_train=X_test, y_train=y_test, X_test=X_train, y_test=y_train
)

Choose a network

m = 100
dim_x = 1
net = dde.nn.DeepONetCartesianProd(
[m, 40, 40],
[dim_x, 40, 40],
“relu”,
“Glorot normal”,
)

Define a Model

model = dde.Model(data, net)

Compile and Train

model.compile(“adam”, lr=0.001, metrics=[“mean l2 relative error”])
losshistory, train_state = model.train(iterations=200, batch_size=656)
```, the model does not train (the loss is the same every epoch). but when I make batch_size 655, it trains. (The code is from here: Antiderivative operator from an aligned dataset — DeepXDE 1.11.2.dev16+ga856b4e documentation )
I have talked to the author of deepXDE about this, and he does not know what is going on, which makes me think it is a keras/tensorflow quirk
Switching to float32 instead of mixed_float16 also fixes this, but my whole purpose was to try to use mixed_float16, which has worked well with other types of models in deepXDE.

Topic		Replies	Views
Gradient accumulation - strange behaviour General Discussion help_request , keras , api , gradienttape	0	727	September 20, 2022
Model not trainable when using keras.layers.Resizing Keras models , keras , help_request	8	2199	November 3, 2022
Keras Inference vs Training Results General Discussion keras , models , custom-layer	1	564	January 18, 2024
Variable <tf.Variable 'eqn1_1/constant1:0' shape=(1,) dtype=float32> has `None` for gradient General Discussion help_request , keras , models	1	775	September 9, 2023
Create the correct variable dtype on custom layer when using mixed precision General Discussion models , keras	4	1484	November 28, 2022

Discrete Change In Batch Size Causes Gradients To Be Exactly 0

Load dataset

Choose a network

Define a Model

Compile and Train

Related topics