Need some help to accelerate data retrieval in training pipeline

MJ_Z · January 9, 2025, 6:54am

Hi y’all. I’m trying to implement a time-series prediction model, and my current implementation was highly input-bounded. To be more specific, what I want sorta resembles

keras.utils.timeseries_dataset_from_array

However, I need to design a custom index retrieval process, i.e., only a subset of indices are valid, and I could only sample data pairs like (X[i-seq_len+1:i], y[i]) for all i in valid indices. Plus, I find this implementation too slow, resulting the majority part of time spent on retrieving (iterating) data. So I need a more efficient implementation.

Therefore, I leveraged tf.data.Dataset.interleave API to my dataset, and got a decent performance improve immediately. However, when I watched trace viewer from Tensorflow Profiler, I noticed that there were only 5 threads in tf_data_private_threadpool.

Moreover, I tried the following tricks:

tf.config.threading.set_inter_op_parallelism_threads(16)
tf.config.threading.set_intra_op_parallelism_threads(16)
options = tf.data.Options()
options.threading.private_threadpool_size = NUM_PARALLELS
train_dataset = train_dataset.with_options(options)

but nothing became better.

Moreover, here is the step time of my current implementation:

and here is the result of trace viewer:

It seems that there are only 5 threads responsible for generating data. However, the throughput of interleave process is set to be 16 (which also equals NUM_PARALLELS in the above code).

I don’t know how to increase the number of threads. I’ve tried to set some parameters, but nothing has changed.

Could anyone help me to increase the parallelism of tf.data pipeline?

rcauvin · January 9, 2025, 4:08pm

However, the throughput of interleave process is set to be 16 (which also equals NUM_PARALLELS in the above code).

So in your call to interleave, you specified num_parallel_calls=NUM_PARALLELS?

MJ_Z · January 10, 2025, 2:52am

Yeah, here is my concrete implementation:

def _get_tf_dataset_from_sequence(
    self,
    seq: tf.keras.utils.Sequence,
    batch_size: int,
    to_categorical: bool,
    is_test: bool,
):
    """
    https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator
    SOME ARGUMENTS ARE DEPRECATED: (output_shapes, output_types).
    They will be removed in a future version.
    Instructions for updating: Use output_signature instead
    """
    # X
    output_signature = [
        tuple(tf.TensorSpec(
            shape=(batch_size, sequence_length, X_prime_df.shape[1]),
            dtype=tf.float32,
        ) for (X_prime_df, sequence_length) in zip(self.X_list, self.sequence_length_list))
    ]
    # y
    output_signature.append(
        tf.TensorSpec(shape=(batch_size, len(self.sequence_length_list)), dtype=tf.float32),
    )

    # NUM_PARALLELS = 24
    NUM_PARALLELS = 64

    def to_generator(seq: tf.keras.utils.Sequence, generator_id: int):
        def wrapped_callable():
            for i in range(generator_id, len(seq), NUM_PARALLELS):
                yield seq[i]

        return wrapped_callable

    dataset = (
        tf.data.Dataset.from_tensor_slices(
            [
                tf.data.Dataset.from_generator(
                    to_generator(seq, i), output_signature=signature
                ).prefetch(4)
                for i in range(NUM_PARALLELS)
            ]
        )
        .interleave(
            lambda x: x,
            cycle_length=NUM_PARALLELS,
            block_length=1,
            num_parallel_calls=tf.data.AUTOTUNE,
        )
        .prefetch(2 * NUM_PARALLELS)
    )
    dataset = dataset.apply(tf.data.experimental.assert_cardinality(len(seq)))

    options = tf.data.Options()
    # options.threading.max_intra_op_parallelism = 1
    # options.threading.private_threadpool_size = 32 # NUM_PARALLELS
    options.experimental_distribute.auto_shard_policy = (
        tf.data.experimental.AutoShardPolicy.OFF
    )
    dataset = dataset.with_options(options)
    return dataset

So current implementation is based on keras.utils.Sequence which fetches 1 batch of data when calling __getitem__(self, index). However, current implementation mainly has 2 major problems:

It seems that prefetch of each shard will be called lazily, resulting the first NUM_PARALLELS steps in model.fit lack parallelism;
prefetch in each shard does not retain between consecutive epochs. When model.fit moves to next epoch, this dataset will NOT utilize prefetch result provided by the previous epoch.

MJ_Z · January 10, 2025, 4:11am

The following profiling result corroborates the first problem:

MJ_Z · January 10, 2025, 4:13am

Moreover, let me elaborate what I want to obtain and what I’m suffering:

Topic		Replies	Views
Parallel data extraction with tf.data.Dataset.from_generator General Discussion datasets	2	786	June 1, 2023
How to speedup input pipeline beyond vectorize and num_parallel_calls? General Discussion datasets , text-vectorization , epoc , tfdata	2	359	January 17, 2024
How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset General Discussion tfdata	1	365	October 11, 2024
Dataset with tf.data with batches of randomly n chosen speakers and their m utterances TensorFlow datasets , tfdata	1	772	July 5, 2023
The speed of training is reduced using a custom method in tensorflow.keras.layers General Discussion keras , tfdata , help_request	9	2599	January 19, 2022

Need some help to accelerate data retrieval in training pipeline

Related topics