Hey! I have just recently started with TF and ML in general and wanted to use random forest on our dataset. I was pretty excited when I saw that there was finally something out for the newer version (compared to having to run TF 1.15) and a great guide to it, however I am already struggling at the first steps (I am rather new to docker and linux as well).
and then when trying to import it, it obviously says cannot find decision-forests module. I looked at the newer versions of the decision forests module, but they only seem to be python 3.7 or higher, the container however runs 3.6? When I tried upgrading python it kinda broke everything else in it, so I am not sure what to do. It would be great if anyone has a solution to this!
I see! I got it to work in Colab now which works for now, I am not sure how to add a validation set to it though (so have train, validation and test), any ideas?
The error was caused by the absence of the TF-DF pip package for py3.6. This is now solved. Thanks for the alert :).
Others might see the same error if they try to install TF-DF via pip on Windows or MacOS – we’re working on releasing those soon, and will update our Known Issues docs when we do!
Thanks Bhack for the answer. Following are some more details:
Tl;dr: A validation set is not required for training (see rationale), and if you use one, you shouldn’t pass it to fit(); rather to evaluate().
Splitting your data into train/validation/test is a generally good practice for ML. The reason most people do this is to tune their training algorithm on held-out data to have better results without skewing their final test eval.
Decision forests generally deal with relatively small datasets, and TF-DF always internally holds out some parts of the training set to do something similar (stop training early if it looks like it will overfit). Because the datasets are small, it can be helpful to just train on all the examples from train + validation (concatenate them in the call to fit()).
You can use the model self-evaluation (e.g. out-of-bag for random forest) to get the held-out evaluation that is done during training.
If you want to evaluate your model on the validation split for another reason (e.g. hyperparameter tuning), you should call model.evaluate(validation_ds) manually. TF-DF always trains for exactly one epoch, so the evaluation you might expect from fit() while using a different TF Keras model won’t be what you get here.
Thank you Arvind for the extra details. So what will happen when you will pass the validation_data arg to fit() in this case?
I think that users have habits to naturally use validation_data arg in fit so it could be nice to have some disclaimer on the unexpected effects of this arg in the example/notebook or docs.
For now, nothing, unfortunately – which is different than what would happen in the usual Keras model: in the usual Keras model one would get a history of evaluations on the return.
For now we briefly documented the difference (see fit method returns), but we are already working on fixing this – it should be coming in the next few days (now I notice we should have documented on validation_data argument, as well as on the various models).
The simple work around is for now call model.evaluate() on your validation data. Notice that DF only train on one epoch, so one will only get one evaluation on the validation dataset anyway.
I was able to fit my RandomForest model, however when I try to convert it into tflite format it throws error.
The error is : InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array.
Unfortunately TFLite does not yet implement TF-DF models. We definitely would like to implement that, if we see more need. Pls, if you don’t mind, create an “issue” in our github repository for that, so we can track others that may be interested in a TFLite version.
In the short term, for a very fast/cheap inference for a purely decision forest models, consider doing inference using the TF-DF C++ library called Yggdrasil. There is an example that you can use to get started – it will read the TF-DF saved model that you trained in TensorFlow directly.
The Decision Forest models served in this fashion are often incredibly low-latency / low-cost. You can measure the serving speed without writing code using the benchmark inference tool.
Just as an update, as of release 0.1.4 passing validation_data (or other forms of validation input) to Model.fit() should lead to an evaluation at the end of the epoch, that is returned back on the History object returned by Model.fit.
okay I’ll try again with better formatting this time:
Hey, it’s me again! Your input really helped and I just quickly wanted to run this by to see whether what I’m doing is as it is intended: I have one .csv that i split into training and testing, and another .csv that i want to use purely as testing set to compare with the first. I started out as per your beginner tutorial with this
It makes sense. There are situations where having multiple test datasets, each with a different distribution is useful :).
In your current formulation, two independent models are trained (model_1 and model_2), but none of them are evaluated on a test dataset. Here is something closer to what you describe:
# We assume the setup from the beginner colab (https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)
train_ds_pd, test_ds_pd = split_dataset(...)
# In addition, here is the second dataset you mentioned: The "purely testing" set.
# Note: "test_ds_pd" is also a pure testing set.
pure_test_ds_pd = ...
# Train the model on the train split.
model = tfdf.keras.RandomForestModel()
model.fit(tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label))
# Add some metrics for the model evaluation.
model.compile(metrics=["accuracy", tf.keras.metrics.Recall(), tf.keras.metrics.Precision(), tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()])
# Evaluate the model on the test split of the first dataset.
evaluation_on_test = model.evaluate(tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label))
# Evaluate the model on the second dataset i.e. "the pure test" one.
evaluation_on_pure_test = model.evaluate(tfdf.keras.pd_dataframe_to_tf_dataset(pure_test_ds_pd, label=label))
Right, for some reason I trained 2 models witht the same parameters to then evaluate each set on one instead of both on the same model… Thanks a bunch for your fast reply!
Sorry for the many basic questions, but my dataset contains some numerical values but also a lot of booleans (shown as 0 and 1), which are used as numeric features by default, is that an issue? If yes, how do I fix it? And if not, does it have any other implications (e.g. for loss)?
As you correctly noted, TF-DF detects boolean features as numerical ones.
There is no impact (good or bad) on the quality or inference speed of the model.
However, this will impact slightly the training speed of the model. Yggdrasil Decision Forests (the core library behind TF-DF) supports boolean features natively, so they should be made available in TF-DF soon .
Ah thanks, hopefully last one: how do I know which of my classes is my positive and negative class in binary classification? and can I change this or specify this somehow (other than switchingthe labels in the dataset)?