Hi knowledgable people,
I have one hot encoded biological data of different length as feature and a simple float as label. Due to the length differences I’m building a ragged tensor from my feature array and combine this tensor and the label numpy array into one dataset object.
X = tf.ragged.constant(X_np, dtype=tf.int8, ragged_rank=1, row_splits_dtype=tf.int32)
train_dataset = tf.data.Dataset.from_tensor_slices((X, y))
train_dataset.element_spec
(TensorSpec(shape=(None, 4), dtype=tf.int8, name=None),
TensorSpec(shape=(), dtype=tf.float64, name=None))
I want to use this dataset for a k-fold cross validation run. My thought was to use the dataset.window() method to split the dataset into multiple pieces, use one as validation set and concatenate the others to form the training set and repeat k times. The documentation states that .window() returns a dataset of datasets that one can loop over. The simple example given works like a charm. But using my own data it does not and so far I can’t figure out why.
This code creates the pieces but trying to inspect the element_spec or accessing the dataset method .concatenate ends up in an error.
for w in train_dataset.window(math.ceil(len(train_dataset) / num_splits)):
print(w)
w.element_spec
(<_VariantDataset element_spec=TensorSpec(shape=(None, 4), dtype=tf.int8, name=None)>, <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.float64, name=None)>)
...
AttributeError: 'tuple' object has no attribute 'element_spec'
The error is clear that the tuple doesn’t have an element_spec attribute. But why is it a tuple and not the expected dataset object?
I’m probably overlooking something simple or maybe I’m approaching this problem from the wrong angle all together so I would be highly appreciative if somebody could give me some pointers or point me in a direction that works.
THANKS!
Richard
PS: I’m using the dataset because due to the in-homogeneous feature array I need the padded_batch function for the model training. I could of course use numpy arrays up to that point however building the ragged tensor from the numpy array is a very time consuming task so if possible I would prefer doing that only once.