Hi,
I’m trying to build a model part of which would need to be a BertEncoder
(models/official/nlp/modeling/networks/bert_encoder.py at master · tensorflow/models · GitHub). However for some reason which I haven’t been able to figure out, I can’t seem to find a correct format in which to pass the inputs to the model. Dict or list, the execution fails to an InvalidArgumentError
raised by the position_embedding
-layer.
To see what’s happening, I’ve also build the encoder till the word_embedding
-layer subclassing tf.keras.Model
as in the source code, but here I get a TypeError
.
For preprocessing I’ve used a Transformers tokenizer as I’m working with TurkuNLP/bert-base-finnish-cased
-model (TurkuNLP/bert-base-finnish-cased-v1 · Hugging Face) and this was readily loadable there. The tokenizer output is of type transformers.tokenization_utils_base.BatchEncoding
which I’ve changed to a dict
so that I wouldn’t need to install the transformers
-library to every Colab Notebook I’ll be working with.
Any advice would be highly appreciated as I’m getting quite frustrated here
Code for the relevant parts below:
import tensorflow as tf
import tensorflow_models as tfm
encoder = tfm.nlp.networks.BertEncoder(vocab_size=50105, type_vocab_size=2)
X_train_inputs
> {'input_word_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
> array([[ 102, 18381, 519, ..., 0, 0, 0],
> [ 102, 956, 38898, ..., 0, 0, 0],
> ...,
> [ 102, 11779, 2404, ..., 0, 0, 0],
> [ 102, 38571, 8273, ..., 0, 0, 0]], dtype=int32)>,
> 'input_mask': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
> array([[1, 1, 1, ..., 0, 0, 0],
> [1, 1, 1, ..., 0, 0, 0],
> ...,
> [1, 1, 1, ..., 0, 0, 0],
> [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
> 'input_type_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
> array([[0, 0, 0, ..., 0, 0, 0],
> [0, 0, 0, ..., 0, 0, 0],
> ...,
> [0, 0, 0, ..., 0, 0, 0],
> [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>}
input_sample = {'input_word_ids' : X_train_inputs['input_word_ids'][0],
'input_mask' : X_train_inputs['input_mask'][0],
'input_type_ids' : X_train_inputs['input_type_ids'][0]}
input_sample
> {'input_word_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
> array([ 102, 18381, 519, 4404, 2026, 11284, 25142, 119, 2959,
> ...,
> 0, 0, 0, 0, 0, 0, 0, 0],
> dtype=int32)>,
> 'input_mask': <tf.Tensor: shape=(512,), dtype=int32, numpy=
> array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> ...,
> 0, 0, 0, 0, 0, 0], dtype=int32)>,
> 'input_type_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
> array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> ...
> 0, 0, 0, 0, 0, 0], dtype=int32)>}
encoder_layer = hub.KerasLayer(encoder)
encoder_layer(input_sample)
> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
>
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
>
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
> • inputs=tf.Tensor(shape=(512, 768), dtype=float32)
input_list = [input_sample['input_word_ids'], input_sample['input_mask'], input_sample['input_type_ids']]
type(input_list)
> list
encoder_layer(input_list)
> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
>
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
>
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
> • inputs=tf.Tensor(shape=(512, 768), dtype=float32)
The TypeErrors with the subclassed word_embeddings:
> TypeError: 'dict' object cannot be interpreted as an integer
> TypeError: Dimension value must be integer or None or have an __index__ method, got value 'ListWrapper([<tf.Tensor: shape=(512,), dtype=int32, numpy=
> array...