How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB)

Timo_v · October 13, 2021, 7:48am

Hey,
I’m relatively new to tensorflow so please excuse me if this is a

beginner question:D

Here is what I’m trying to do:
I have 2 separate Datasets:

melanoma_ds: contains 10000 true positive cases (Tensorflow dataset)
no_melanoma_ds: contains 10000 true negative cases (Tensorflow dataset)

I would like to concatenate these two datasets and do a shuffle afterwards.
train_ds = no_melanoma_ds.concatenate(melanoma_ds)

My problem is the shuffle.

I want to have a well shuffled train dataset so I tried to use:
train_ds = train_ds.shuffle(20000)
I’m using google colab and it seems like I ran out of graphic card memory (limit of 11GB).
→ colab session crashes

So I tried to pick smaller batches (5000 instead of 20000) from my two datasets no_melanoma_ds
and melanoma_ds shuffle them with a buffer_size of only 5000 and concatenate all of them afterwards:

def memory_efficient_shuffle(melanoma_ds=melanoma_ds, no_melanoma_portion=no_melanoma_portion):
  shuffle_rounds = 4
  tmp_start = 0
  batch_each_class = 2500
  final_shuffle_ds = None

  for i in range(shuffle_rounds):
    tmp_start = batch_each_class * i
    tmp_melanoma_ds = melanoma_ds.skip(tmp_start).take(batch_each_class)
    tmp_no_melanoma_ds = no_melanoma_portion.skip(tmp_start).take(batch_each_class)
    both_portions_ds = tmp_melanoma_ds.concatenate(tmp_no_melanoma_ds)
    shuffled_portion_ds = both_portions_ds.shuffle(5000)
    
    final_shuffled_ds = shuffled_portion_ds if final_shuffle_ds == None else final_shuffle_ds.concatenate(shuffled_portion_ds)
  return final_shuffled_ds

This actually works and the session does not crash…
But if I try to pick the first element of the shuffled dataset it takes too much time and i don’t know if the program will ever terminate.

final_shuffled_ds = memory_efficient_shuffle()
image, label = next(iter(final_shuffled_ds))

I bet I made a lot of mistakes during the whole process:D
I really would like to know how would you approach this?

OriAlpha · October 13, 2021, 10:40am

try using float32 data units instead of float64 this could save your space issue.

RZK · October 13, 2021, 4:26pm

Hi Timo_v,

Is your data set images? If so, you can use ImageDataGenerator to load and shuffle the data for you.
If it is not images, you can assign each piece of data a unique identifier and shuffle the array of identifiers. You can then lazily load each piece of data as you iterate the the array.
On top of that, Colab has pro versions that may allow you to have increased memory.

8bitmp3 · October 13, 2021, 5:27pm

Good question @Timo_v and welcome to TF Forum! Looping in @markdaoust

Mark_Daoust · October 13, 2021, 5:31pm

Okay.

The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory.

Remember that .skip still loads all the data it just throws the first N on the floor.
And that it still has to load at least 5k images before it returns the first batch.

Yes, if you have loose image files. But prefer tf.keras.utils.image_dataset_from_directory.

If you don’t have directories of image files … what do you have? Where are these melanoma_ds and no_melanoma_ds coming from?

Timo_v · October 19, 2021, 3:55pm

It’s is already stored in the float32 format.

Timo_v · October 19, 2021, 4:05pm

yes you are right my data set contains images.
Thank you I will give it a try:)!

Timo_v · October 19, 2021, 4:14pm

Thank you for this detailed explanation!

The data is in the form of tensorflow records.
This is the link to the dataset: https://www.kaggle.com/cdeotte/melanoma-384x384
It contains a lot of information but I only use the images and the corresponding labels.

Mark_Daoust · October 19, 2021, 4:36pm

Oh It’s TFRecord files.

So what you want to do then is not try to shuffle all the images together, but shuffle the list of files (Dataset.list_files shuffles the order each epoch). And then do a smaller shuffle of the individual examples. Start with something like this:

# list_files shuffles the order each iteration
ds = tf.data.Dataset.list_files("trian*")
ds = ds.interleave(tf.data.TFRecordDataset, ...)
ds = ds.shuffle(...)

Ref: Dataset.interleave

Topic		Replies	Views
How to do Minority class sampling using tensorflow? General Discussion tfdata , help_request	1	1117	June 13, 2021
Getting memory error when training a larger dataset on the GPU General Discussion datasets , gpu	15	13493	December 15, 2023
Dataset with tf.data with batches of randomly n chosen speakers and their m utterances TensorFlow datasets , tfdata	1	780	July 5, 2023
Dataset memory footprint keeps growing General Discussion api , keras , tfdata	5	1378	September 25, 2023
How to train a model with huge data and limited GPU memory using tf.data.Dataset APIs Keras models , gpu	5	621	July 14, 2023

How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB)

Related topics