Balancing Dominant Feature Importances

Hi there,

Is there an equivalent to xgboost’s colsample_by* parameter? The idea behind xgboost’s colsample_by* parameter is to specify the fraction of feature columns to be subsampled at the tree, level and node.

I find my tfdf gradient boosted tree models become obsessed with certain features, and was wondering if there was a way to balance out the importances? Although the performance is good on test data, I am attempting to reduce the risk of one of the features going wrong in production and severely impacting my predictions.

Below is the way I am currently calculating importances. Perhaps I am doing something wrong here:

for feature, imp_score in model.make_inspector().variable_importances()["SUM_SCORE"]:
            feature_importances[feature[0]] = imp_score

Thank you!

1 Like

Hi @shayan_sadeghieh

The Xgboost’s colsample_by* parameter is not available with tfdf GBT model. However, the parameter SubSample in the API tfdf.keras.GradientBoostedTreesModel create a column-wise subsampling similar to colsample_by* parameter. Setting smaller value for subsample can reduce correlation similarity between trees so that you can balance the feature importance. Alternatively, the feature importance can be implemented efficiently by preprocessing the data using correlation, mutual information methods.

Thank You

1 Like