How is Shuffle random

I am learning about shuffle in tf.data here is my question
how is selection of element from buffer is random,the first n buffer elements are selected directly
without random sampling so when you consider large dataset the first elements have higher
probability of being selected first since they were already present in buffer

it’s not unless you load the whole dataset. If a subset is loaded, the shuffling occurs with certain locality.

For example:

import tensorflow as tf
dataset = tf.data.Dataset.range(1000)
dataset = dataset.shuffle(10, reshuffle_each_iteration=True)
print([i.numpy() for i in list(dataset)])

Gives you

[9, 1, 11, 4, 2, 13, 7, 16, 12, 8, 10, 19, 21, 20, 5, 15, 6, 22, 23, 25, 17, 28, 31, 26, 18, 29, 24, 33, 36, 3, 27, 40, 0, 35, 32, 30, 37, 42, 46, 41, 38, 34, 14, 49, 53, 47, 51, 52, 56, 48, 50, 39, 60, 44, 63, 45, 43, 66,

Whereas if you load the 1000 at once, you get:

[928, 834, 50, 642, 221, 93, 581, 466, 905, 811, 942, 687, 209, 657, 930, 636, 407, 656, 128, 82,...

I don’t know anymore specifics but that’s sort of the practical part. You can check the docs as well.

Hi @Satej_Raste,

The shuffle operation in tf.data first fills a buffer with a set number of elements (buffer_size). Once the buffer is full, elements are randomly selected from this buffer to feed your model and then replaced with new ones from the dataset. This continuous selection and replacement guarantees continuous shuffling throughout each dataset iteration.

1 Like