I’m trying to build a data pipeline using tf.data that would build a time series of the past 5 rows. Since I have more than 6000 different csv files, these cannot fit in memory and cannot be preprocessed and saved in disk due to size.
To read the 6K csv’s I’m trying to use tf.data.experimental.make_csv_dataset
and apply an overlapping window with the window function. The closest I’ve got so far is:
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="/path/stock/*1min*.csv",
batch_size=1,
num_epochs=1,
shuffle=False,
header=False,
column_names=['timestamp','open','high', 'low', 'close', 'volume'],
column_defaults=[tf.string, tf.float32, tf.float32, tf.float32, tf.float32, tf.float32]
).window(
size=5, # Number of rows per window
shift=1, # Stride for overlapping windows
stride=1
)
Ideally I should end up with a shape of (M, 5 timesteps, 4 [open, low, high, close])
I’m trying to write a map function that would take the dataset of datasets and convert it so that I can feed that to the model.
The issues I’m facing are: Im not able to write the map function to build and return the arrays or each timestep. I havent found a way to group the timesteps by date, as I dont want to build timesteps spanning multiple days. Lastly even if I attach a shuffle to the window dataset, it seems to only shuffle from a single file and not from the whole dataset.
How to solve this issue? Is there a better way to accomplish this task? Any help is greatly appreciated.