Dataset from generator shuffling

Afshin_Samani · September 16, 2021, 1:22pm

Hi,

I have made a dataset from generator like:

ds_series = tf.data.Dataset.from_generator(
trim_size, args=[data_input_tot_EqLen, trimmed_lbl, seq_len, max_len_per],
output_types=(tf.float32, tf.int32),
output_shapes=((5511, 101, 3), (1)))

then I shuffle the dataset and split it to training and testing:

ds_series= ds_series.shuffle(buffer_size=16)
ds_train=ds_series.take(train_smpls)
ds_valid=ds_series.skip(train_smpls)

I’d like to count the number of samples in each class, therefore, I’d like to see what labels would be assigned to the training and testing dataset.

I run the following command:

_, lbl_train = ds_train

this take a lot of time (I understand this because trim_size I defined above in pretty heavy) but my question is related to the messages that it shows:

I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 1 of 16

so it counts filling up the buffer from 1 to 16. however, this does not fit with what has mention about shuffle buffer size in the documentation:

it is supposed to take random samples from a 16 sample-buffer which means that the randomization process is not limited to 16.

Am I wrong here?

Renu_Patel · January 23, 2024, 6:31pm

Hi @Afshin_Samani

Welcome to the TensorFlow Forum!

Yes, The buffer_size means the number of elements to keep in memory for shuffling and then randomly samples elements from this buffer, replacing the selected elements with new elements as mentioned the same in the dataset.shuffle() definition. This fetches a new element from the dataset to replace the selected one in the buffer to maintain a full buffer and this process continues to ensure that elements are randomly shuffled before being yielded.

Please see the example description:

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer.

For the Error - I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 1 of 16

This informational message indicates the buffer filling process in memory and you will not see these messages anymore once the buffer is full and dataset shuffling will start from the buffer.

Thank you.

Topic		Replies	Views
How is Shuffle random General Discussion tfdata	2	62	July 9, 2024
My Tensorflow Data pipeline has some issues returning same class samples for all steps General Discussion datasets , help_request	2	408	September 22, 2022
Unknown/reduced dataset length after resampling General Discussion datasets	1	376	November 29, 2024
TypeError: dataset length is unknown tensorflow General Discussion help_request	9	8683	May 20, 2021
Getting memory error when training a larger dataset on the GPU General Discussion datasets , gpu	15	13422	December 15, 2023

Dataset from generator shuffling

Related topics