Feature selection and tree size in TF-DF

Question on feature selection:

I was looking at figure 6 of

It mentioned feature subset selection at say 95% of cumulative loss reduction.

This was a common question from leads in launch reviews using Ranklab models as well. Like … “Can a much smaller number of features get almost as much accuracy as the full model?”

Curious to hear your thoughts on this and the support TFDF might have for this?

Or if you feel there is an established approach in the industry for this that we can build on top of TFDF that would be useful as well.

There is another dimension to this question, about the choice of number of trees in the model. For instance, section 7 of the aforementioned paper claims:

We have presented a tradeoff between the number of boosted decision trees and accuracy. It is advantageous to keep the number of trees small to keep computation and memory contained.

Curious to hear your thoughts about this as well.

Hello @Gaurav_Chakravorty

By using the tutorials form TensorFlow, we can understand and model the Feature subset selection by using common metrics like mean decrease impurity, by calculating Feature importance. TFDF also offers inbuilt methods for the Feature subset selection, which contribute to significantly to accuracy.

In the case of no. of Trees in the model, It is a trade-off, If no. of trees are more, model can learn complex relationships but the computational power required is more, If trees are less in number, there is a scope of overfitting. So the tuning based on optimal no.of trees and accuracy is required.

In the current scenario SHAPley values can be used to select important features from dataset, there are other methods as well like TCAV algorithm and etc.