Hi all,
I am trying to train some decision forests using either tensorflow_decision_forests
or the most recent API ydf
. My data is too large to be loaded into a pandas dataframe, and I am using tf.Datasets.
However, I noticed that neither library seems to be designed to have a long (40k+) vector of input features. In the case of tensorflow_decision_forests
, above a certain number of features I would start getting Large Unrolled Loop errors. In the case of ydf
, it seems to me that requires every feature to be explicitly labelled (? very expensive to create such a dataset with 90k+ features) and it seems to not accept a simple vector.
Does anyone have some suggestions? I used XGboost in the past and it was much more straightforward.
Example of some code that would crash using tensorflow_decision_forests:
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'
import tensorflow_decision_forests as tfdf
import numpy as np
import tensorflow as tf
# Example data
# Assuming features is a NumPy array of shape (num_samples, num_features)
# and labels is a NumPy array of shape (num_samples,)
num_samples = 8000
num_features = 90000
# Generate dummy data
features = np.random.random((num_samples , num_features))
features[0:1000,:] = 1
labels = np.array([10] * 1000 + [0] * (num_samples - 1000)) # For binary classification
# Create a tf.data.Dataset
def make_dataset(features, labels, batch_size=1024):
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.batch(batch_size)
return dataset
# Create the dataset
dataset = make_dataset(features, labels)
# Define the model
model = tfdf.keras.RandomForestModel()
# Train the model
model.fit(dataset)
# Evaluate the model
evaluation = model.evaluate(dataset)
print(f"Evaluation: {evaluation}")
# Make predictions
predictions = model.predict(features)
print(f"Predictions: {predictions[:5]}")
Thank you!
Hi,
Like TF-DF, YDF should be able to consume natively multi-dimensional features without the need to separate them into individual single-dimension ones. This should also be much more efficient in YDF that TF-DF (there should not be any lag for 90k features).
You can see an example here: Multi-dimensional - YDF documentation
Pandas does not work well with such multi-dimensional features. In this case, I find it easier for the dataset to be a dictionary of numpy arrays.
90k features is a lot. If your dataset also has a lot of training examples, using distributed training might be necessary :).
Hope this help.
1 Like
Hi Matheiu,
thanks a lot for the help, it was really crucial to manage to train. I reduced the number of features and examples and it works, I suppose it’s just a memory issue.
However, I am now encountering issues with predicting. If I pass a numpy vector of similar format to the training one, it requests a dictionary where each of the features is named, for instance:
ValueError: The data spec expects columns 'features.00000_of_19831' which was not found in the data. Available columns: ['features'].
is there a way to just get predictions from a numpy array?
Thank you!
I suspect the error message is not describing the real problem. “features.00000_of_19831” is a virtual name, the real feature name is “features” which you have according to the error message. Instead, I suspect YDF is failing to recognize and ingest “features” for some other reason.
Try to call predict on the same dataset as used for the training. For instance:
train_ds = {
"label": np.array(...),
"features": np.array(...),
}
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
predictions = model.predict(train_ds)
If this works, this error could be cause by a discrepancy between the dtype or shape of your training and testing dataset. For example, make sure the “features” numpy array in the test dataset has the same shape[1]
as the one in the train dataset.
If this does not work, please share some code snippet as it could help debug this issue.
Thanks again, and you are very correct.
It turns out that the issue is that the input must be a numpy array of size (1, n_features) and not (n_features), and that the error message is just misleading.
Great.
And we will improve the error message so other users benefit from your report