A keras
model does learn using Numpy arrays as the input, but fails to make any progress if reading data from a tf.data
pipeline. What could be the reason potentially?
In particular, the model consumes batched multidimensional time series (so every point is an N x M tensor) and solves a classification problem. If the data are prepared in advance, by aggregating the time series in a large Numpy array, then the model successfully learns as indicated by a significant increase in the accuracy. However, when exactly the same input data is prepared using tf.data
pipeline, the accuracy remains at the baseline level.
I compared the two sets of data by writing to disk, and they are identical. Also the types match.
Tried disabling threading (IIUC) by setting
options.threading.private_threadpool_size = 1
and experimenting with a bunch of options.experimental_optimization
options.
Could it be the case that the data are read in parallel from the tf.data
dataset as opposed to being read sequentially from the Numpy array?
For completeness, here’s the pipeline, where np_array
contains “raw” data:
ds = tf.data.Dataset.from_tensor_slices(np_array.T)
y_ds = (
ds
.skip(T - 1)
.map(lambda s: s[-1] - 1)
.map(lambda y: to_categorical(y, 3))
)
X_ds = (
ds
.map(lambda s: s[:n_features])
.window(T, shift=1, drop_remainder=True)
.flat_map(lambda x: x.batch(T, drop_remainder=True))
.map(lambda x: tf.expand_dims(x, -1))
)
Xy_ds = (
tf.data.Dataset.zip(X_ds, y_ds)
.batch(size_batch)
.repeat(n_epochs * size_batch)
.prefetch(tf.data.AUTOTUNE)
)
and how fit()
is called (the steps_per_epoch
value is correct)
model.fit(
Xy_train,
epochs=n_epochs,
steps_per_epoch=199,
verbose=2
)