Memory leak with training different models in loops

DefinitlyNotAModel · March 27, 2026, 8:45am

TensorFlow Memory Growth Issue in Training Loop

Summary

I encountered a progressive memory growth issue when repeatedly creating, training, and deleting a tf.keras.Model inside a loop. Despite explicitly clearing the session, deleting the model, and forcing garbage collection, memory usage keeps increasing over time.

This behavior is consistent across:

Operating Systems: Linux, Windows 11
Python Versions: 3.11.15, 3.12.15
TensorFlow Variants: tensorflow, tensorflow-cpu

Minimal Reproducible Example

Code

import tensorflow as tf
import time
import psutil
import os
import gc

p = psutil.Process(os.getpid())

class MyModel(tf.keras.Model):

    def __init__(self):
        super().__init__()
        self.dense1 = tf.keras.layers.Dense(100, activation=tf.nn.relu)
        self.dense2 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)
        self.dense3 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)
        self.dense4 = tf.keras.layers.Dense(100, activation=tf.nn.softmax)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        x = self.dense3(x)
        x = self.dense4(x)
        return x

mem = []

for r in range(0, 200):
    mem.append(round(p.memory_info().rss / 1024**2, 3))

    model = MyModel()

    ds = tf.data.Dataset.from_tensor_slices(
        (tf.random.uniform((64 * 4, 1000)), tf.ones((64 * 4)))
    )

    model.compile(
        optimizer='sgd',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    )

    model.fit(ds.batch(64), verbose=0)

    del model
    tf.keras.backend.clear_session()
    gc.collect()
    time.sleep(3)

Observed Behavior

Memory usage (RSS) steadily increases with each loop iteration.
This occurs despite:
- del model
- tf.keras.backend.clear_session()
- gc.collect()
- No persistent references to the model or dataset

The memory progression has already been plotted using the mem list, and clearly shows a linear or step-wise increase in memory consumption over time.

Expected Behavior

Memory should:

Stabilize after a few iterations, or
Be reclaimed after session clearing and garbage collection

Additional Notes

I found this issue during hyperparameter optimization that includes training multiple models in the same session
The dataset is recreated every loop but is small and should not cause accumulation.
No custom training loop is used—only model.fit.
The issue appears independent of:
- Hardware
- OS
- Python version
- TensorFlow CPU/GPU variant

Questions

Is this expected behavior due to internal TensorFlow caching or graph tracing?
Could this be related to:
- tf.function retracing?
- Dataset pipeline caching?
- Backend allocator behavior?

Attempted Mitigations

Forcing garbage collection (gc.collect()) → no improvement
Clearing session → no improvement
Deleting model → no improvement

Feedback

Any insights or suggestions would be greatly appreciated.
Happy to provide additional diagnostics if needed!

DefinitlyNotAModel · March 30, 2026, 1:38pm

So, I found that if you set run_eagerly=True in model.compile and keep the cleanup (tf.keras.backend.clear_session(), del, and gc.collect()), you don’t run into the memory allocation issue anymore—memory increases a little at first but then stabilizes. But, of course the runtime increased a lot.

DefinitlyNotAModel · April 3, 2026, 9:56am

I found a feasible workaround to deal with memory issues training multiple models in the SAME PROCESS. I tested a lot, so what i figured out: The main culprit is tracing in the underlying C++ libraries. I dont know how they are implemented but i read that caches there are not freed after a model and tf.Dataset is deleted (or even the computation graph is cleaned). Tensorflow will keep these caches in the memory during the whole process. Only after the process ends, the memory will be freed.

Exactly this behaviour proposes the workaround: Wrap your model training into a extra process that ends when the training and evaluating of a model is done. The parent process can retrieve results and the child process ends (and thus the C++ caches are freed too). In this way, you there will be no infinite memory growth anymore.

Divya_Sree_Kayyuri · April 7, 2026, 7:47am

Hi @DefinitlyNotAModel, Glad to know that you’ve resolved the issue and thank you for sharing such a detailed explanation and workaround, this will be helpful to others facing the same problem.
Thank you!

Topic		Replies	Views
Keras Model Memory Leak Keras tfkeras	1	1027	June 19, 2024
How to Control Memory Growth When Using TensorFlow in Multi-Round Training? TensorFlow models , datasets	0	125	October 30, 2024
Tensorflow memory leak in loop TensorFlow keras , memory , gpu	1	782	January 2, 2024
Call Tensorflow Model in a loop leaks memory General Discussion nlp , keras , transformers	1	1473	September 25, 2023
Tensorflow memory leak during inference in loop General Discussion model-code , tfkeras	2	424	March 6, 2024