How to Control Memory Growth When Using TensorFlow in Multi-Round Training?

Lekang_Zhang · October 30, 2024, 5:29am

Hello TensorFlow community,

I’m facing an issue related to memory growth when using TensorFlow for a multi-round training process. Specifically, I have a model training loop in which I generate training and evaluation data in each round, and my memory usage seems to keep growing, eventually causing out-of-memory errors. I’m trying to understand how I can effectively manage or release memory during these iterations.

Here is a simplified version of my code


# Define TensorFlow variables for training data
for num_round in range(1, 1 + total_num_round):
  
    train_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, epochs_t + 1)
    eval_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, eval_num_batch)

    # train and evaluate process

    # delete used data
    del train_data, eval_data
    gc.collect()

Issues I’m Facing:

The train_data and eval_data generated in each round occupy a lot of memory, and I cannot seem to release this memory effectively, leading to continuous memory growth.
I have tried several approaches to control memory usage:
1. Using assign() instead of repeatedly defining train_data and eval_data .
2. Using gc.collect() and del train_data, eval_data to free up memory, but these methods did not work.
The function generate_all_batch_s_path_samples is not decorated with tf.function because it uses threading for parallel computation, which makes it incompatible with tf.function .

Questions:

Is there a more effective way to release memory between iterations, besides using tf.keras.backend.clear_session() ?
Is there a recommended approach to managing memory growth in multi-round training scenarios like this?

Any advice, suggestions, or code examples would be greatly appreciated! Thank you all in advance for your help.

Context:

I’m using TensorFlow 2.16.0.
The data generation process (generate_all_batch_s_path_samples ) creates new tensors for training and evaluation in each round.

Thanks again for your support!

Topic		Replies	Views
Keras Model Memory Leak Keras tfkeras	1	987	June 19, 2024
Tensorflow memory leak in loop TensorFlow keras , memory , gpu	1	764	January 2, 2024
Tensorflow memory leak during inference in loop General Discussion model-code , tfkeras	2	401	March 6, 2024
Proper way to reinitialize dataset General Discussion datasets , tfdata , help_request	7	2916	June 28, 2021
Call Tensorflow Model in a loop leaks memory General Discussion nlp , keras , transformers	1	1442	September 25, 2023

How to Control Memory Growth When Using TensorFlow in Multi-Round Training?

Related topics