Need some help to accelerate data retrieval in training pipeline

Hi y’all. I’m trying to implement a time-series prediction model, and my current implementation was highly input-bounded. To be more specific, what I want sorta resembles

keras.utils.timeseries_dataset_from_array

However, I need to design a custom index retrieval process, i.e., only a subset of indices are valid, and I could only sample data pairs like (X[i-seq_len+1:i], y[i]) for all i in valid indices. Plus, I find this implementation too slow, resulting the majority part of time spent on retrieving (iterating) data. So I need a more efficient implementation.

Therefore, I leveraged tf.data.Dataset.interleave API to my dataset, and got a decent performance improve immediately. However, when I watched trace viewer from Tensorflow Profiler, I noticed that there were only 5 threads in tf_data_private_threadpool.

Moreover, I tried the following tricks:

tf.config.threading.set_inter_op_parallelism_threads(16)
tf.config.threading.set_intra_op_parallelism_threads(16)
options = tf.data.Options()
options.threading.private_threadpool_size = NUM_PARALLELS
train_dataset = train_dataset.with_options(options)

but nothing became better.

Moreover, here is the step time of my current implementation:

Imgur

and here is the result of trace viewer:

Imgur

It seems that there are only 5 threads responsible for generating data. However, the throughput of interleave process is set to be 16 (which also equals NUM_PARALLELS in the above code).

I don’t know how to increase the number of threads. I’ve tried to set some parameters, but nothing has changed.

Could anyone help me to increase the parallelism of tf.data pipeline?

However, the throughput of interleave process is set to be 16 (which also equals NUM_PARALLELS in the above code).

So in your call to interleave, you specified num_parallel_calls=NUM_PARALLELS?