I am working on a paper comparing Python libraries for machine learning and deep learning.
Trying to evaluate Keras and TensorFlow separately, I’m looking for information about TensorFlow methods or functions that can be used to preprocess datasets, such as those included in scikit-learn (sklearn.preprocessing) or the Keras preprocessing layers, but I can’t find anything beyond a one hot enconding for labels…
Tensorflow transform can actually even take in keras preprocessing layers, which certain caveats. It uses apache beam to scale the pipeline and does a lot to help with pipeline reproducibility.
The scikit-learn comparison is especially interesting as the design choices of an all in-memory approach vs. a streaming approach become quite apparent. They do have a lot of commonalities such as the goal for using the same pipeline for training the data as is used at prediction time. The definition of pipeline itself is quite overloaded in the tensorflow ecosystem. For example, a tfx pipeline and a tft pipeline, how do they differ and what is their relationship with each other is an interesting point. For example, if I remember correctly, column_selector in scikit-learn can be directly integrated into the scikit learn pipeline, whereas in tfx, tensorflow dava validation handles inferring the schema, and tft uses the schema and enriches that, and other artifacts, for use downstream. As such, tfx feels much more de-coupled, but necessarily more complex and powerful with a steeper learning curve.
Hopefully this is enough to get you started, let me know if you need any further information.