Complex (NumPy) transformations possible in tf.data.Dataset pipeline?

MaL · September 21, 2022, 11:20am

Hi,

I want to use tf.data.Dataset as main building block in my data pipeline for training a neural network with Tensorflow that deals with time series. Ideally without resorting to custom dataloader classes.

Question: How do you perform processing that requires more than what Tensorflow can express? I.e. operations that require for example Numpy input and therefore cannot be integrated in the Tensorflow graph.

Example: Given time series data, I would like to resample the data to be able to use time series data from different sources in a single training dataset. How can that be achieved?

The reasoning behind the pipeline-integrated transformations are that those transformations only take around 10min on the whole dataset that I use. Hence, I am happy to perform them prior to training instead of deriving a dedicated dataset once.

I am aware of similar questions here (like this). Also, I am aware of Tensorflow Transform and Keras preprocessing layers. None of those options allow for example interpolation. There exists a TF implementation for interpolation - but that only works on an equidistant grid unfortunately. An interesting implementation of interpolation in TensorFlow is this one; however, I would much prefer to existing implementations in SciPy or NumPy.

What is your workflow to implement preprocessing steps that are easy with NumPy and alike if one-time performance is not crucial? Maybe using a custom dataloader is in fact easier than relying on tf.data.Dataset for those preprocessing steps?

Thanks and best wishes!

Renu_Patel · August 10, 2023, 2:49pm

Hi @MaL ,

Welcome to the TensorFlow Forum!

You can convert the numpy dataset into the tf.data.dataset format using any of the tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices().

import numpy as np
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10000))
dataset

then define the function for preprocessing as requirement and map that function to the entire dataset. Please refer these resampling and timeseries technique to tackle with different imbalanced datasets for preprocessing.

Topic		Replies	Views
Performing data wrangling on tf.data.Dataset General Discussion tfx , datasets , help_request	4	2233	January 19, 2022
Preprocessing in TensorFlow General Discussion keras , help_request , tensorflow	2	582	May 31, 2022
Using TF timeseries_dataset_from_array with more samples General Discussion tfdata , help_request	2	1881	January 16, 2023
Dataset of Datasets pipeline General Discussion tfdata , ml_ops , education , help_request	1	2127	May 26, 2022
Tensor flow Dataset Input Pipelines General Discussion datasets	1	382	February 28, 2024

Complex (NumPy) transformations possible in tf.data.Dataset pipeline?

Related topics