TensorFlow Dataset reduce function too slow after skip

Matteo_Doria · November 20, 2023, 12:39pm

Hi everyone,
I’m trying to take 100 random elements from cifar10 dataset for each class and reduce them to a single “image” using the mean.
The problem is that the reduce function time increases significantly after I use the skip function with a large value.

The code is the following:
for c in classes:
skip_elem = 100
class_ds = train_ds.filter(lambda x, y: tf.equal(tf.argmax(y[0]), int(c))).skip(skip_elem).take(100).unbatch()
z = class_ds.reduce(tf.zeros(shape=(res//8, res//8, 4), dtype=tf.float32), lambda a, b: a + b[0])
z /= 100

I’ve tried to batch, rebatch, but it seems that if I skip few elements it executes the reduction in a reasonable time. The “take” function also increases the reduce time, but I expected that.

Why does it happen? There are other possibilities?

Thanks!

aniruthraj · October 29, 2024, 8:59am

Hi @Matteo_Doria,

Sorry for the delay in response.
This slowdown might due to skipping a large number of elements forces the system to read through the dataset sequentially and if the underlying data format is not optimized for random access. So I suggest to use shuffle instead of skip if we have to take random elements in a large dataset.

Code:

for c in classes:
    # Filter for the specific class and shuffle immediately
    class_ds = train_ds.filter(
        lambda x, y: tf.equal(tf.argmax(y[0]), int(c))
    ).shuffle(
        buffer_size=1000,  # Adjust this based on your dataset size
        seed=42  # Optional: set seed for reproducibility
    ).take(100).unbatch()

Please let us know if any issues.Thank You.

Topic		Replies	Views
Filtering plues interleaving a tf.dataset take hours General Discussion tfdata	4	438	April 14, 2023
How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset General Discussion tfdata	1	365	October 11, 2024
The speed of training is reduced using a custom method in tensorflow.keras.layers General Discussion keras , tfdata , help_request	9	2599	January 19, 2022
Tf.random.uniform() brings much latency in preprcessing functions passed to TFRecordDataset.map() General Discussion datasets , help_request	2	430	June 23, 2023
How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB) General Discussion datasets , tfdata , help_request	8	5744	October 19, 2021

TensorFlow Dataset reduce function too slow after skip

Related topics