Which model to use for multicolumn string data

Hello,
I’m a complete beginner, and I don’t know which model I should use for my dataset.
My dataset consists in

  • independent rows (I can split the dataset in N parts, and the outcome should be the same)
  • binary classification (which I’ve already turned in 1s and 0s)
  • several columns (candidate_name, candidate_position, reference_names, reference_position), in which each is a string.

I have a list of candidates and they apply for specific positions. I’d like the training process to notice that the valid lines are those where the candidate_name is amongst the reference_names, and where the candidate position is similar enough to the reference position (“Senior Developer” could be “Senior Dev (C++)” for instance).

So far, I’ve tried using this : Text Classification with Movie Reviews  |  TensorFlow Hub , which uses this model: Google | nnlm | Kaggle , but the dataset they use is a single cell of text (and not 4 as in my example).

Unfortunately, I get the following error :

model = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)
hub_layer(X_train[:3])
File ~/.local/lib/python3.10/site-packages/tensorflow_hub/keras_layer.py:229, in KerasLayer.call(self, inputs, training)
    223 # ...but we may also have to pass a Python boolean for `training`, which
    224 # is the logical "and" of this layer's trainability and what the surrounding
    225 # model is doing (analogous to tf.keras.layers.BatchNormalization in TF2).
    226 # For the latter, we have to look in two places: the `training` argument,
    227 # or else Keras' global `learning_phase`, which might actually be a tensor.
    228 if not self._has_training_argument:
--> 229   result = f()
    230 else:
    231   if self.trainable:

ValueError: Exception encountered when calling layer 'keras_layer_7' (type KerasLayer).

Did I get something completely wrong? Am I using the wrong model? If so, which model would be usable here?

Many thanks !

@PascalBlokur,

Could you please provide us with a sample dataset to get a fair idea?

Thank you!

Sure.

I’ve added the comment column to explain a bit.

candidate name candidate position reference names reference positions outcome comment
Paul Smith Senior Chief Executive Paul Henry Smith, Noemie Schultz Chief Executive 0 not enough experience
Paul H. Smith Chief Executive Paul Henry Smith Chief Executive 1 good candidate
Paul John Smith Chief Executive Paul Henry Smith, Noemie Schultz Chief Executive 0 unknown application
Noemie Deborah Schultz Senior Executive Paul John Smith, Noemie Schultz Senior Chief Executive 1 she’d be great

@PascalBlokur,

You can convert columns into one-hot-encoding and then try any tree based algorithms like Random Forest, XGBoost etc.

Thank you!