Memory leak with training different models in loops

:brain: TensorFlow Memory Growth Issue in Training Loop

:pushpin: Summary

I encountered a progressive memory growth issue when repeatedly creating, training, and deleting a tf.keras.Model inside a loop. Despite explicitly clearing the session, deleting the model, and forcing garbage collection, memory usage keeps increasing over time.

This behavior is consistent across:

  • Operating Systems: Linux, Windows 11
  • Python Versions: 3.11.15, 3.12.15
  • TensorFlow Variants: tensorflow, tensorflow-cpu

:test_tube: Minimal Reproducible Example

Code

import tensorflow as tf
import time
import psutil
import os
import gc

p = psutil.Process(os.getpid())

class MyModel(tf.keras.Model):

    def __init__(self):
        super().__init__()
        self.dense1 = tf.keras.layers.Dense(100, activation=tf.nn.relu)
        self.dense2 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)
        self.dense3 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)
        self.dense4 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        x = self.dense3(x)
        x = self.dense4(x)
        return x

mem = []

for r in range(0, 200):
    mem.append(round(p.memory_info().rss / 1024**2, 3))

    model = MyModel()

    ds = tf.data.Dataset.from_tensor_slices(
        (tf.random.uniform((64 * 4, 1000)), tf.ones((64 * 4)))
    )

    model.compile(
        optimizer='sgd',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    )

    model.fit(ds.batch(64), verbose=0)

    del model
    tf.keras.backend.clear_session()
    gc.collect()
    time.sleep(3)

:chart_increasing: Observed Behavior

  • Memory usage (RSS) steadily increases with each loop iteration.
  • This occurs despite:
    • del model
    • tf.keras.backend.clear_session()
    • gc.collect()
    • No persistent references to the model or dataset

The memory progression has already been plotted using the mem list, and clearly shows a linear or step-wise increase in memory consumption over time.


:thinking: Expected Behavior

Memory should:

  • Stabilize after a few iterations, or
  • Be reclaimed after session clearing and garbage collection

:magnifying_glass_tilted_left: Additional Notes

  • I found this issue during hyperparameter optimization that includes training multiple models in the same session
  • The dataset is recreated every loop but is small and should not cause accumulation.
  • No custom training loop is used—only model.fit.
  • The issue appears independent of:
    • Hardware
    • OS
    • Python version
    • TensorFlow CPU/GPU variant

:red_question_mark: Questions

  • Is this expected behavior due to internal TensorFlow caching or graph tracing?
  • Could this be related to:
    • tf.function retracing?
    • Dataset pipeline caching?
    • Backend allocator behavior?

:puzzle_piece: Attempted Mitigations

  • Forcing garbage collection (gc.collect()) → no improvement
  • Clearing session → no improvement
  • Deleting model → no improvement

:folded_hands: Feedback

Any insights or suggestions would be greatly appreciated.
Happy to provide additional diagnostics if needed!

So, I found that if you set run_eagerly=True in model.compile and keep the cleanup (tf.keras.backend.clear_session(), del, and gc.collect()), you don’t run into the memory allocation issue anymore—memory increases a little at first but then stabilizes. But, of course the runtime increased a lot.

I found a feasible workaround to deal with memory issues training multiple models in the SAME PROCESS. I tested a lot, so what i figured out: The main culprit is tracing in the underlying C++ libraries. I dont know how they are implemented but i read that caches there are not freed after a model and tf.Dataset is deleted (or even the computation graph is cleaned). Tensorflow will keep these caches in the memory during the whole process. Only after the process ends, the memory will be freed.

Exactly this behaviour proposes the workaround: Wrap your model training into a extra process that ends when the training and evaluating of a model is done. The parent process can retrieve results and the child process ends (and thus the C++ caches are freed too). In this way, you there will be no infinite memory growth anymore.

1 Like

Hi @DefinitlyNotAModel, Glad to know that you’ve resolved the issue and thank you for sharing such a detailed explanation and workaround, this will be helpful to others facing the same problem.
Thank you!