Hi Ahad,
The reason TF-DF (TensorFlow Decision Forests) feels strange, and the core answer to your questions, is that the decision forests training algorithms are fundamentally different from the training algorithm for neural networks.
The main difference is that DF do not consume gradients as input during training, and they do not propagate gradients from their output to their inputs. This document gives more details. In practice, this is not entirely true as there are research papers with possible solutions, but this is outside of the scope of classical DF training.
Back to your questions.
- Why ensemble_nn_and_df wasn’t trained?
The four sub components (model1-4) are trained. ensemble_nn_and_df is simply a concatenation of model1-4, so ensemble_nn_and_df does not have any “non-trained” parameters.
- How come I can evaluate ensemble_nn_and_df which was not trained as a hole?
Same reason as 1.
- I trained only ensemble_nn_and_df instead of training all its components, and the accuracy was drooped comparing to training separately as you showed, what is the reason behind that?
TF-DF models can only be trained by calling “.fit” method on the model itself.
When calling ensemble_nn_and_df.fit, it does not call fit on the sub-models. Therefore, only the NN are trained.
Before being trained, a neural network returns “garbage” random values that depend randomly on the value of the input features. For a tf-df model, the situation is different: a non trained TF-DF model always return “0”. In other words, if you only call fit on ensemble_nn_and_df, the ensemble_nn_and_df effectively only contains the neural networks.
- In this example you mentioned fine-tune step but no code example were given, can you please elaborate how can one fine tune?
Because of the back-propagation limitation, a DF cannot be used to finetune NN located before it. See paragraph “For this reasons, the classical RF algorithm cannot be used to train or fine-tune a neural network underneath…”.
However, if you have a NN and a DF in parallel (like in this tutorial), you can back-propagate through the NN to train the “learnable NN preprocessing”. See paragraph “In practice, such a preprocessing layer could either be a pre-trained embedding to fine-tune, or a randomly initialized neural network.”. This is exactly what is done in this tutorial (see the training of “preprocessor”)
If you were to replace model1 and 2 with pre-trained trainable neural networks, the training of “ensemble_nn_only” would be a fine-tuning.
I hope this helps,
Mathieu