Filtering plues interleaving a tf.dataset take hours

Marlon_Henrique_Teix · December 6, 2022, 1:02pm

I’m filtering the dataset according to certain labels. Once I call the filtering method, everything is fine. But once I call next(iter(dataset))for certain values it gets processing for more the 12 hours - for other value it just give the result.

My filtering line code is:

   def balanced_dataset(dataset, labels_list, sample_size=1000):
        datasets_list = []
           for label in labels_list:
              print(f'Preparando o dataset {label}')
              locals()[label] = dataset.filter(lambda x, y: tf.greater(tf.reduce_sum(tf.cast(tf.equal(tf.constant(label, dtype=tf.int64), y), tf.float32)), tf.constant(0.)))
              datasets_list.append(locals()[label].take(sample_size))
          ds = tf.data.Dataset.from_tensor_slices(datasets_list)
          # 2. extract all elements from datasets and concat them into one dataset
          concat_ds = ds.interleave(lambda x: x, cycle_length=len(labels_list), num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)   
    
    
        return concat_ds

barry_kaauamo · December 8, 2022, 6:33am

Typically use them in calculations. A date vector contains the elements [year month day hour. (Removed by Moderator)

Marlon_Henrique_Teix · December 8, 2022, 12:52pm

Sorry, you haven’t been clear enough.

MaxMarion · April 14, 2023, 3:32am

I’m seeing similar behavior when I filter a vast majority of the dataset away. I have a column of 0s/1s, and I filter with dataset.filter(lambda input_dict: input_dict[pruning_feature_name] == 1). When I set only 1% of my dataset to have input_dict[pruning_feature_name] == 1, my batch retrieval time goes from .0006 seconds to 20+ seconds. We also use interleave, albeit before the filter step. Did you manage to figure anything out with this @Marlon_Henrique_Teix ?

Marlon_Henrique_Teix · April 14, 2023, 2:05pm

@MaxMarion After all I decided to mine the relevant data, save it as CSV and then use tf.Data do load the data in a simpler manner. Now, I’m using the TFX library which has a more complex and better data pipeline for it. I suggest using it. Good lucky!

Topic		Replies	Views
How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset General Discussion tfdata	1	369	October 11, 2024
tf.data.Dataset.from_tensor_slices taking too much time General Discussion tfdata	1	615	June 20, 2023
The speed of training is reduced using a custom method in tensorflow.keras.layers General Discussion keras , tfdata , help_request	9	2602	January 19, 2022
How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB) General Discussion datasets , tfdata , help_request	8	5775	October 19, 2021
Parallel data extraction with tf.data.Dataset.from_generator General Discussion datasets	2	813	June 1, 2023

Filtering plues interleaving a tf.dataset take hours

Related topics