How is Shuffle random

Satej_Raste · June 28, 2024, 6:46am

I am learning about shuffle in tf.data here is my question
how is selection of element from buffer is random,the first n buffer elements are selected directly
without random sampling so when you consider large dataset the first elements have higher
probability of being selected first since they were already present in buffer

Mah_Neh · June 29, 2024, 7:59pm

it’s not unless you load the whole dataset. If a subset is loaded, the shuffling occurs with certain locality.

For example:

import tensorflow as tf
dataset = tf.data.Dataset.range(1000)
dataset = dataset.shuffle(10, reshuffle_each_iteration=True)
print([i.numpy() for i in list(dataset)])

Gives you

[9, 1, 11, 4, 2, 13, 7, 16, 12, 8, 10, 19, 21, 20, 5, 15, 6, 22, 23, 25, 17, 28, 31, 26, 18, 29, 24, 33, 36, 3, 27, 40, 0, 35, 32, 30, 37, 42, 46, 41, 38, 34, 14, 49, 53, 47, 51, 52, 56, 48, 50, 39, 60, 44, 63, 45, 43, 66,

Whereas if you load the 1000 at once, you get:

[928, 834, 50, 642, 221, 93, 581, 466, 905, 811, 942, 687, 209, 657, 930, 636, 407, 656, 128, 82,...

I don’t know anymore specifics but that’s sort of the practical part. You can check the docs as well.

Laxma_Reddy_Patlolla · July 9, 2024, 8:41pm

Hi @Satej_Raste,

The shuffle operation in tf.data first fills a buffer with a set number of elements (buffer_size). Once the buffer is full, elements are randomly selected from this buffer to feed your model and then replaced with new ones from the dataset. This continuous selection and replacement guarantees continuous shuffling throughout each dataset iteration.

Topic		Replies	Views
Dataset from generator shuffling General Discussion datasets , tfdata , help_request	1	1764	January 23, 2024
Dataset memory footprint keeps growing General Discussion api , keras , tfdata	5	1351	September 25, 2023
My Tensorflow Data pipeline has some issues returning same class samples for all steps General Discussion datasets , help_request	2	410	September 22, 2022
How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB) General Discussion datasets , tfdata , help_request	8	5743	October 19, 2021
tf.data.Dataset varies at re-iteration. Manual reset possible? General Discussion datasets , keras , help_request	6	2049	September 3, 2022

How is Shuffle random

Related topics