Tensorflow decision forests / ydf with a lot of features

chatnord · June 19, 2024, 11:50am

Hi all,

I am trying to train some decision forests using either tensorflow_decision_forests or the most recent API ydf. My data is too large to be loaded into a pandas dataframe, and I am using tf.Datasets.

However, I noticed that neither library seems to be designed to have a long (40k+) vector of input features. In the case of tensorflow_decision_forests, above a certain number of features I would start getting Large Unrolled Loop errors. In the case of ydf, it seems to me that requires every feature to be explicitly labelled (? very expensive to create such a dataset with 90k+ features) and it seems to not accept a simple vector.

Does anyone have some suggestions? I used XGboost in the past and it was much more straightforward.

Example of some code that would crash using tensorflow_decision_forests:

import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

import tensorflow_decision_forests as tfdf
import numpy as np
import tensorflow as tf

# Example data
# Assuming features is a NumPy array of shape (num_samples, num_features)
# and labels is a NumPy array of shape (num_samples,)
num_samples = 8000 
num_features = 90000

# Generate dummy data
features = np.random.random((num_samples , num_features))
features[0:1000,:] = 1
labels = np.array([10] * 1000 + [0] * (num_samples - 1000))  # For binary classification

# Create a tf.data.Dataset
def make_dataset(features, labels, batch_size=1024):
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    dataset = dataset.batch(batch_size)
    return dataset

# Create the dataset
dataset = make_dataset(features, labels)
# Define the model
model = tfdf.keras.RandomForestModel()

# Train the model
model.fit(dataset)

# Evaluate the model
evaluation = model.evaluate(dataset)
print(f"Evaluation: {evaluation}")

# Make predictions
predictions = model.predict(features)
print(f"Predictions: {predictions[:5]}")

Thank you!

Mathieu · June 19, 2024, 1:36pm

Hi,

Like TF-DF, YDF should be able to consume natively multi-dimensional features without the need to separate them into individual single-dimension ones. This should also be much more efficient in YDF that TF-DF (there should not be any lag for 90k features).

You can see an example here: Multi-dimensional - YDF documentation

Pandas does not work well with such multi-dimensional features. In this case, I find it easier for the dataset to be a dictionary of numpy arrays.

90k features is a lot. If your dataset also has a lot of training examples, using distributed training might be necessary :).

Hope this help.

chatnord · June 19, 2024, 8:06pm

Hi Matheiu,
thanks a lot for the help, it was really crucial to manage to train. I reduced the number of features and examples and it works, I suppose it’s just a memory issue.
However, I am now encountering issues with predicting. If I pass a numpy vector of similar format to the training one, it requests a dictionary where each of the features is named, for instance:

ValueError: The data spec expects columns 'features.00000_of_19831' which was not found in the data. Available columns: ['features'].

is there a way to just get predictions from a numpy array?

Thank you!

Mathieu · June 20, 2024, 10:34am

I suspect the error message is not describing the real problem. “features.00000_of_19831” is a virtual name, the real feature name is “features” which you have according to the error message. Instead, I suspect YDF is failing to recognize and ingest “features” for some other reason.

Try to call predict on the same dataset as used for the training. For instance:

train_ds = {
"label": np.array(...),
"features": np.array(...),
}
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
predictions = model.predict(train_ds)

If this works, this error could be cause by a discrepancy between the dtype or shape of your training and testing dataset. For example, make sure the “features” numpy array in the test dataset has the same shape[1] as the one in the train dataset.

If this does not work, please share some code snippet as it could help debug this issue.

chatnord · June 20, 2024, 11:00am

Thanks again, and you are very correct.
It turns out that the issue is that the input must be a numpy array of size (1, n_features) and not (n_features), and that the error message is just misleading.

Mathieu · June 21, 2024, 3:09pm

Great.

And we will improve the error message so other users benefit from your report

Topic		Replies	Views
Decision Forest module yanked General Discussion decision_forests , tfdf , random_forests	21	4626	March 13, 2023
Multilabel categorization and Tensorflow Decision Forests General Discussion models , tfdf , help_request	5	1449	August 15, 2022
Predict with tfdf General Discussion tfdf , help_request	5	1661	June 9, 2021
Decision Forest - Random Forests RAM issues General Discussion models , keras , decision_forests , tfdf , help_request	15	2451	August 18, 2021
Predict_on_batch function of keras.RandomForestModel throws AssertionError General Discussion models , random_forests , help_request	6	2111	May 18, 2022

Tensorflow decision forests / ydf with a lot of features

Related topics