keras model does not learn if using tf.data pipeline

A keras model does learn using Numpy arrays as the input, but fails to make any progress if reading data from a tf.data pipeline. What could be the reason potentially?

In particular, the model consumes batched multidimensional time series (so every point is an N x M tensor) and solves a classification problem. If the data are prepared in advance, by aggregating the time series in a large Numpy array, then the model successfully learns as indicated by a significant increase in the accuracy. However, when exactly the same input data is prepared using tf.data pipeline, the accuracy remains at the baseline level.

I compared the two sets of data by writing to disk, and they are identical. Also the types match.

Tried disabling threading (IIUC) by setting

options.threading.private_threadpool_size = 1

and experimenting with a bunch of options.experimental_optimization options.

Could it be the case that the data are read in parallel from the tf.data dataset as opposed to being read sequentially from the Numpy array?

For completeness, here’s the pipeline, where np_array contains “raw” data:

ds = tf.data.Dataset.from_tensor_slices(np_array.T)
y_ds = (
    ds
   .skip(T - 1)
   .map(lambda s: s[-1] - 1)
   .map(lambda y: to_categorical(y, 3))
)
X_ds = (
    ds
    .map(lambda s: s[:n_features])
    .window(T, shift=1, drop_remainder=True)
    .flat_map(lambda x: x.batch(T, drop_remainder=True))
    .map(lambda x: tf.expand_dims(x, -1))
)
Xy_ds = (
    tf.data.Dataset.zip(X_ds, y_ds)
    .batch(size_batch)
    .repeat(n_epochs * size_batch)
    .prefetch(tf.data.AUTOTUNE)
)

and how fit() is called (the steps_per_epoch value is correct)

model.fit(
    Xy_train,
    epochs=n_epochs,
    steps_per_epoch=199,
    verbose=2
)

Hi @Barbara, If possible please provide the sample data that you have used for training and also please provide the complete code to reproduce the issue.

I don’t think it will be the reason for the model not learning, reading data in parallel will reduce the ideal time of the hardware device.

Have you performed the same preprocessing for creating the numpy array and the dataset pipe line. Thank You.

1 Like

Hey Kiran, thank you very much for your interest! The problem arose due to the implicit shuffling in fit() when working with Numpy array data, contrasting with the absence of such shuffling in the tf.data pipeline. Shuffling is suitable even in time series analysis since the data were pre-aggregated, thus preserving the internal temporal structure.

1 Like