I would like to concatenate these two datasets and do a shuffle afterwards. train_ds = no_melanoma_ds.concatenate(melanoma_ds)
My problem is the shuffle.
I want to have a well shuffled train dataset so I tried to use: train_ds = train_ds.shuffle(20000)
I’m using google colab and it seems like I ran out of graphic card memory (limit of 11GB).
→ colab session crashes
So I tried to pick smaller batches (5000 instead of 20000) from my two datasets no_melanoma_ds
and melanoma_ds shuffle them with a buffer_size of only 5000 and concatenate all of them afterwards:
def memory_efficient_shuffle(melanoma_ds=melanoma_ds, no_melanoma_portion=no_melanoma_portion):
shuffle_rounds = 4
tmp_start = 0
batch_each_class = 2500
final_shuffle_ds = None
for i in range(shuffle_rounds):
tmp_start = batch_each_class * i
tmp_melanoma_ds = melanoma_ds.skip(tmp_start).take(batch_each_class)
tmp_no_melanoma_ds = no_melanoma_portion.skip(tmp_start).take(batch_each_class)
both_portions_ds = tmp_melanoma_ds.concatenate(tmp_no_melanoma_ds)
shuffled_portion_ds = both_portions_ds.shuffle(5000)
final_shuffled_ds = shuffled_portion_ds if final_shuffle_ds == None else final_shuffle_ds.concatenate(shuffled_portion_ds)
return final_shuffled_ds
This actually works and the session does not crash…
But if I try to pick the first element of the shuffled dataset it takes too much time and i don’t know if the program will ever terminate.
Is your data set images? If so, you can use ImageDataGenerator to load and shuffle the data for you.
If it is not images, you can assign each piece of data a unique identifier and shuffle the array of identifiers. You can then lazily load each piece of data as you iterate the the array.
On top of that, Colab has pro versions that may allow you to have increased memory.
The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory.
Remember that .skip still loads all the data it just throws the first N on the floor.
And that it still has to load at least 5k images before it returns the first batch.
The data is in the form of tensorflow records.
This is the link to the dataset: https://www.kaggle.com/cdeotte/melanoma-384x384
It contains a lot of information but I only use the images and the corresponding labels.
So what you want to do then is not try to shuffle all the images together, but shuffle the list of files (Dataset.list_files shuffles the order each epoch). And then do a smaller shuffle of the individual examples. Start with something like this:
# list_files shuffles the order each iteration
ds = tf.data.Dataset.list_files("trian*")
ds = ds.interleave(tf.data.TFRecordDataset, ...)
ds = ds.shuffle(...)