I am trying to test a simple model using a SentencePieceTokenizer layer
over a (HuggingFace) dataset. But I seem unable to get the shape of
the dataset’s target to agree with the model’s output. All code
available here](rikHak/tst_240311.py at master · rbelew/rikHak · GitHub)
First, I get the dataset from HF and convert it to the tensorflow
version that keras.Model.fit()
expects using
trainDS = LH_dataset_HF['train'].to_tf_dataset(
columns=["text"],
label_cols=["answer"],
batch_size=batch_size,
shuffle=False,
)
I can demonstrate the data is loaded and the SPTokenizer is working as
expected:
trainTF shape=(6, 3) answer shape=(6,)
all answers=[b'Yes' b'Yes' b'Yes' b'No' b'No' b'No']
echo1: b'My roommate and I were feeling unwell in our basement apartment for a long ...
My model begins with a keras_nlp.tokenizers.SentencePieceTokenizer
layer, has one embedding layer, and then makes a prediction:
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer) │ (None) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sentence_piece_tokenizer │ (None, 32) │ 0 │
│ (SentencePieceTokenizer) │ │ │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embed (Embedding) │ (None, 32, 100) │ 1,000,000 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ predictions (Dense) │ (None, 32, 1) │ 101 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,000,101 (3.82 MB)
Trainable params: 1,000,101 (3.82 MB)
Non-trainable params: 0 (0.00 B)
But when I try to model.fit(trainDS)
I get
ValueError: Arguments `target` and `output` must have the same rank (ndim). Received: target.shape=(None,), output.shape=(None, 32, 1)
Questions
-
Why is does
target.shape=(None,)
? -
Is the model lacking a layer mapping the predictions to the
answer
strings? And/or, should the answer column be mapped to integers
instead of strings?
Package versions
torch=2.1.0.post100
torchtext=0.16.1
tensorflow=2.15.0
tensorflow_text=2.15.0
keras=3.0.5
keras_nlp=0.7.0