I am comparing a tfdf RandomForestModel, specifically for a regression task, to the performance of an sklearn RandomForestRegressor model. The published hyperparameters for the sklearn model are:
- max_features=6
- n_estimators=50
- max_depth=None
- min_samples_split=2
I am not getting similar performance for the two models.
My constructors are the following and fit calls are the following:
sk_rf_model = RandomForestRegressor(max_features=6, n_estimators=50, max_depth=None, min_samples_split=2)
sk_rf_model.fit(X_npy, y_npy, sample_weight=train_data[‘sample_weight’])
RMSE: 0.01954
MAE: 0.0059
tfdf_rf_model = tfdf.keras.RandomForestModel(num_trees=50, verbose=2, num_candidate_attributes=6, min_examples=2, max_depth=None, task=tfdf.keras.Task.REGRESSION, num_threads=1)
tfdf_rf_model.model_1.fit(x=X_time_space_npy, y=y_npy, sample_weight=train_data[‘sample_weight’].to_numpy())
RMSE: 0.02304
MAE: 0.0088
I set the num_threads to 1 to compare single-threaded to single-threaded behavior, that does not alleviate the difference.
The previously published model I am comparing uses the noted RF hyperparameters.
Sklean RandomforestRegressor Documentation Hyperparameters:
max_features {“sqrt”, “log2”, None}, int or float, default=1.0
The number of features to consider when looking for the best split:
- If int, then consider
max_features
features at each split. - If float, then
max_features
is a fraction andmax(1, int(max_features * n_features_in_))
features are considered at each split. - If “sqrt”, then
max_features=sqrt(n_features)
. - If “log2”, then
max_features=log2(n_features)
. - If None or 1.0, then
max_features=n_features
.
min_samples_split int or float, default=2
The minimum number of samples required to split an internal node:
- If int, then consider
min_samples_split
as the minimum number. - If float, then
min_samples_split
is a fraction andceil(min_samples_split * n_samples)
are the minimum number of samples for each split.
Tensorflow Decisionforest Documentation Hyperparameters:
num_candidate_attributes
Number of unique valid attributes tested for each node. An attribute is valid if it has at least a valid split. If num_candidate_attributes=0
, the value is set to the classical default value for Random Forest: sqrt(number of input attributes)
in case of classification and number_of_input_attributes / 3
in case of regression. If num_candidate_attributes=-1
, all the attributes are tested. Default: 0.
min_examples
Minimum number of examples in a node. Default: 5.