Keras Model Memory Leak

I use the following codes to repeat training neural networks of the same structure many times in a for-loop.
However, the CPU’s memory continues going up, while GPU’s graphic memory is stable.
I have added manual GC at the end of the loop, but there’s still memory leak.
How to prevent memory leak when training these keras models?

I know there is a workaround load_weights and save_weights. Is there other solutions? I’m curious why even manual GC cannot stop it from leaking.

Ubuntu 22.04 LTS, TensorFlow 2.13.0, NVIDIA GTX 4090

...
l0 = Input(shape=(x_train.shape[1], x_train.shape[2]))
l1 = LSTM(32)(l0)
l2 = Dense(32, activation='relu')(l1)
l3 = Dense(1, activation='linear')(l2)
basic_model = tf.keras.Model(l0, l3)
for i, j in product(range(x_train.shape[2]), repeat=2):  # x_trian.shape[2] >= 13
    ... # define different datasets
    ur = tf.keras.models.clone_model(basic_model)
    ur.compile(optimizer='adam', loss='mse')
    sbm_ur = tf.keras.callbacks.ModelCheckpoint(f'raw/3_{dataset_name}_nn_{lag}/{height}_{i}_{j}_ur.h5', save_best_only=True)
    ur.fit(x_train_ur, y_train_ur, validation_data=(x_valid_ur, y_valid_ur), epochs=5000, batch_size=10000, callbacks=[stop_early, sbm_ur], verbose=0)
    r = tf.keras.models.clone_model(basic_model)
    r.compile(optimizer='adam', loss='mse')
    sbm_r = tf.keras.callbacks.ModelCheckpoint(f'raw/3_{dataset_name}_nn_{lag}/{height}_{i}_{j}_r.h5', save_best_only=True)
    r.fit(x_train_r, y_train_r, validation_data=(x_valid_r, y_valid_r), epochs=5000, batch_size=10000, callbacks=[stop_early, sbm_r], verbose=0)
    ...  # statistical inference
    del r, ur
    tf.keras.backend.clear_session()
    del x_train_ur, y_train_ur, x_train_r, y_train_r, \
    x_valid_ur, y_valid_ur, x_valid_r, y_valid_r, \
    x_test_ur, y_test_ur, x_test_r, y_test_r
    gc.collect()

Hi @cloudy ,

Welcome to the TensorFlow Forum .

Manual garbage collection might not be sufficient for TensorFlow’s memory management. This mechanism might not sync perfectly with Python’s garbage collector. By implementing tf.keras.backend.clear_session() after each training iteration, you may effectively address the memory leak issue. Also please upgrade to the latest TF version and let us know?

Thank you !