Hi Kader,
Thanks for the enthusiasm and the great questions
how this new TensorFlow Decision Forest differs from the already Tree based algorithms we’ve got in tf.estimator module.
There are two main differences: API and algorithms.
The API:
TF-DF uses the Keras API while tf.estimator.BoostedTrees uses the tf1 estimator API. We think TF-DF is simpler to use (no need to create feature columns, no input_fn, etc.) and to compose (e.g. stacking models with tf.Sequential, or use a tf-hub embedding for pre-processing).
The algorithms:
TF-DF is a collection of algorithms all implemented in c++. By default, it runs the classical/exact Random Forest and Gradient Boosted Machine algorithms, which are similar to scikit-learn or R Random Forest. With hyper-parameters, you can enable more recent logics, similar to the ones used in XGBoost, LightGBM, and even some newer ones (e.g. sparse oblique trees works very well ).
Tf.estimator.BoostedTreesEstimator is implemented in TensorFlow and can be seen as an approximate Gradient Boosted Trees algorithm with a mini-batch training procedure described in this paper. We didn’t implement this algorithm in TF-DF, because in all our experiments/projects one of the other algorithms performed better.
TF-DF and Tf.estimator.BoostedTreesEstimator don’t share any code.
Also, does this new TF-DF library mean that no more need for those from scikit-learn or even xgboost ?
Short answer: no!
There are many great decision forest libraries out there (XGBoost, CatBoost, LightGBM, SciKit, R gbm, R random Forest, R ranger, etc.), each one with a different set of algorithms and framework integration. It is awesome to have such diversity.
In general the right library is the one that can be used easily (e.g. depending on the infra constraints and modeling complexity) and give good results (which might vary slightly according to implementations, and depend on the problem).
TF-DF focuses on Python or C++, and integrates well into the TensorFlow toolbox, which we believe can be compelling in many use-cases.
And last but not least, should we tag it tf-df or tfdf ?
tf-df is the official shortcut.But https://tensorflow-prod.ospodiscourse.com/ does not support tags with “-”, so let’s do tfdf.