Hello TensorFlow community,
I’m facing an issue related to memory growth when using TensorFlow for a multi-round training process. Specifically, I have a model training loop in which I generate training and evaluation data in each round, and my memory usage seems to keep growing, eventually causing out-of-memory errors. I’m trying to understand how I can effectively manage or release memory during these iterations.
Here is a simplified version of my code
# Define TensorFlow variables for training data
for num_round in range(1, 1 + total_num_round):
train_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, epochs_t + 1)
eval_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, eval_num_batch)
# train and evaluate process
# delete used data
del train_data, eval_data
gc.collect()
Issues I’m Facing:
- The
train_data
andeval_data
generated in each round occupy a lot of memory, and I cannot seem to release this memory effectively, leading to continuous memory growth. - I have tried several approaches to control memory usage:
- Using
assign()
instead of repeatedly definingtrain_data
andeval_data
. - Using
gc.collect()
anddel train_data, eval_data
to free up memory, but these methods did not work.
- Using
- The function
generate_all_batch_s_path_samples
is not decorated withtf.function
because it uses threading for parallel computation, which makes it incompatible withtf.function
.
Questions:
- Is there a more effective way to release memory between iterations, besides using
tf.keras.backend.clear_session()
? - Is there a recommended approach to managing memory growth in multi-round training scenarios like this?
Any advice, suggestions, or code examples would be greatly appreciated! Thank you all in advance for your help.
Context:
- I’m using TensorFlow 2.16.0.
- The data generation process (
generate_all_batch_s_path_samples
) creates new tensors for training and evaluation in each round.
Thanks again for your support!