Hello,
I am doing some preprocessing in the model, with tabular data. I have many features, some categorical, some numerical. For the numerical ones, in Load CSV data | TensorFlow Core, they advice to concatenate then normalize. Why not the opposite: normalize then concatenate? What are the tradeoffs here?
How do I make sure the proper mean is applied to the proper feature? Only the features order?
Should I follow the same order (concatenate then normalize), if I use Discretization instead of Normalization.
Here is the code I use (with concatenate then normalize):
numeric_features = df[numerical_features_names]
numeric_features_dict = {key: value.to_numpy()[:, tf.newaxis] for key, value in dict(numeric_features).items()}
normalize_num=False
if normalize_num:
layer1 = tf.keras.layers.Normalization(axis=-1)
layer1.adapt(np.concatenate([value for key, value in sorted(numeric_features_dict.items())], axis=1))
else:
layer1_discretization_params_dict = {
'f1': [0, 10, 20, 30],
'f2': [0, 70, 100]
}
layer1 = tf.keras.layers.Discretization(bin_boundaries=[discretization_values_dict[key] for key, value in sorted(numeric_features_dict.items())])
numeric_inputs = []
for name in numerical_features_names:
numeric_inputs.append(inputs[name]) #inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
numeric_inputs = tf.keras.layers.Concatenate(axis=-1)(numeric_inputs)
numeric_normalized = layer1(numeric_inputs)
Thank you.
Bruno