Hi all,
Novice question here. I have a very large dataset that I want to feed into a tfdf model. How can I couple tfdf.keras.pd_dataframe_to_tf_dataset() to a generator feeding the data to it? Of course, if there are better methods for feeding a generator into tfdf dataset than using this method I’d be interested to know.
tfdf.keras.pd_dataframe_to_tf_dataset is just a convenience method if you already have a dataframe. If you are starting off with a generator, you can use it to create a tf.data.Dataset directly from it tf.data.Dataset | TensorFlow v2.16.1, which might be a better fit – TF-DF datasets are not special, and any dataset object created for another keras model should work.
There are the following caveats:
Your dataset still needs to fit in memory at training time. If this is not the case, you can do something like dataset = dataset.take(10000000) to subsample 10 million rows (the exact number will depend on the number of features + memory capacity)
There are a few data sanitization steps happening in pandas_dataframe_to_tf_dataset like making sure feature names don’t have spaces or other forbidden characters, you might want to consult the source code and copy that logic as is appropriate.