Hi Matt,
Your description of the situation is correct.
As you noted, you can specify the semantic (numerical, categorical) of input features using the features
arguments. However, this only works if the input features are presented as a dictionary.
A solution is to separate the numerical and categorical features before feeding them into the model. You will end-up with a model that consumes dictionaries. If you need your model to consume a feature matrix, you can them group the “separation logic” and the “dictionary model” into a new supermodel using the Keras model functional API.
Alternatively, an equivalent, but simpler solution, is to use the processing argument available in all the TF-DF model and inject the separation logic inside of the model.
Here is an example:
features = [[1.1,4.1,1],
[3.1,5.1,2],
[2.1,6.1,3],
[3.1,7.1,4]]
labels = [0,1,0,1]
# A matrix training dataset.
tf_dataset = tf.data.Dataset.from_tensor_slices((features,labels)).batch(2)
def preprocessing(features):
"""Splits the feature matrix into a dictionary of features."""
# The first two columns are numerical.
numerical_features = features[:,:2]
# The last two columns are categorical.
categorical_features = features[:,2:]
return {"numerical_features" : numerical_features,
"categorical_features" : tf.cast(categorical_features,tf.int32)}
# Specify the semantic of the features.
features = [
tfdf.keras.FeatureUsage(name="numerical_features", semantic=tfdf.keras.FeatureSemantic.NUMERICAL),
tfdf.keras.FeatureUsage(name="categorical_features", semantic=tfdf.keras.FeatureSemantic.CATEGORICAL),
]
model = tfdf.keras.GradientBoostedTreesModel(
verbose=2,
preprocessing=preprocessing,
features=features)
model.fit(tf_dataset)
Following is the part of the training logs that describe the dataset:
Training dataset:
Number of records: 4
Number of columns: 4
Number of columns by type:
CATEGORICAL: 2 (50%)
NUMERICAL: 2 (50%)
Columns:
CATEGORICAL: 2 (50%)
0: "categorical_features" CATEGORICAL integerized vocab-size:6 no-ood-item
3: "__LABEL" CATEGORICAL integerized vocab-size:3 no-ood-item
NUMERICAL: 2 (50%)
1: "numerical_features.0" NUMERICAL mean:2.35 min:1.1 max:3.1 sd:0.829156
2: "numerical_features.1" NUMERICAL mean:5.6 min:4.1 max:7.1 sd:1.11803
Terminology:
nas: Number of non-available (i.e. missing) values.
ood: Out of dictionary.
manually-defined: Attribute which type is manually defined by the user i.e. the type was not automatically inferred.
tokenized: The attribute value is obtained through tokenization.
has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
vocab-size: Number of unique values.
You can see that 2 features are considered NUMERICAL and the 2 other are CATEGORICAL.
I hope this helps.
M.