I used both sklearn’s random forests and tfdf on the same dataset. The results was very different between the two. Below was my configurations for the sklearn one.
While both SkLearn and TF-DF implement the classical Random Forest algorithm, there is some few differences in between the implementations. For this reason, it is expected for the results (both the model structure and model quality) not to be exactly the same (but still very close).
Following are some parameter values that should make sklearn’s RandomForestClassifier as close as possible to TF-DF’s Random Forest.
PS: Random Forest and Gradient Boosted Trees are different algorithms.
In addition, if the problem is regressive, make sure to have:
max_features = 1./3
If your dataset contains categorical or categorical-set features, there are not equivalent parameters for sklearn as it does not support those type of features.
If the differences are larges, it would be very interesting for us to look at it.
I set everything just like the given code snippet. It’s intriguing, isn’t it?
The datasets I used for the 2 models were basically the same - all categorical data (text) was removed - The targets (ground truth) were mapped to positive integer index [0, 1, 2]. Basically, the ingredients for sklearn and TFDF are the same.
Notice that the dataset is very imbalanced, but the TFDF did a very impressive job. This is every cool but I don’t want be fooled by the metrics. I just wanna make sure the models work correctly. ^^
Just to clarify, the performance sklearn’s and tensorlfow’s random forests is the same. It was actually my fault in processing the data - I removed the most important feature out of the training data. In my case, the sklearn’s site works a little better. Have a nice day!