BertEncoder inputs?

Hi,

I’m trying to build a model part of which would need to be a BertEncoder (models/official/nlp/modeling/networks/bert_encoder.py at master · tensorflow/models · GitHub). However for some reason which I haven’t been able to figure out, I can’t seem to find a correct format in which to pass the inputs to the model. Dict or list, the execution fails to an InvalidArgumentError raised by the position_embedding -layer.

To see what’s happening, I’ve also build the encoder till the word_embedding -layer subclassing tf.keras.Model as in the source code, but here I get a TypeError.

For preprocessing I’ve used a Transformers tokenizer as I’m working with TurkuNLP/bert-base-finnish-cased -model (TurkuNLP/bert-base-finnish-cased-v1 · Hugging Face) and this was readily loadable there. The tokenizer output is of type transformers.tokenization_utils_base.BatchEncoding which I’ve changed to a dict so that I wouldn’t need to install the transformers -library to every Colab Notebook I’ll be working with.

Any advice would be highly appreciated as I’m getting quite frustrated here :slight_smile:

Code for the relevant parts below:

import tensorflow as tf
import tensorflow_models as tfm

encoder = tfm.nlp.networks.BertEncoder(vocab_size=50105, type_vocab_size=2)
X_train_inputs

> {'input_word_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[  102, 18381,   519, ...,     0,     0,     0],
>         [  102,   956, 38898, ...,     0,     0,     0],
>         ...,
>         [  102, 11779,  2404, ...,     0,     0,     0],
>         [  102, 38571,  8273, ...,     0,     0,     0]], dtype=int32)>,
>  'input_mask': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[1, 1, 1, ..., 0, 0, 0],
>         [1, 1, 1, ..., 0, 0, 0],
>         ...,
>         [1, 1, 1, ..., 0, 0, 0],
>         [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
>  'input_type_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[0, 0, 0, ..., 0, 0, 0],
>         [0, 0, 0, ..., 0, 0, 0],
>         ...,
>         [0, 0, 0, ..., 0, 0, 0],
>         [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>}

input_sample = {'input_word_ids' : X_train_inputs['input_word_ids'][0],
                'input_mask' : X_train_inputs['input_mask'][0],
                'input_type_ids' : X_train_inputs['input_type_ids'][0]}
input_sample

> {'input_word_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([  102, 18381,   519,  4404,  2026, 11284, 25142,   119,  2959,
>             ...,
>             0,     0,     0,     0,     0,     0,     0,     0],
>        dtype=int32)>,
>  'input_mask': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>        ...,
>         0, 0, 0, 0, 0, 0], dtype=int32)>,
>  'input_type_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>        ...
>         0, 0, 0, 0, 0, 0], dtype=int32)>}

encoder_layer = hub.KerasLayer(encoder)
encoder_layer(input_sample)

> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
> 
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
> 
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
>   • inputs=tf.Tensor(shape=(512, 768), dtype=float32)

input_list = [input_sample['input_word_ids'], input_sample['input_mask'], input_sample['input_type_ids']]
type(input_list)

> list

encoder_layer(input_list)

> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
> 
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
> 
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
>   • inputs=tf.Tensor(shape=(512, 768), dtype=float32)

The TypeErrors with the subclassed word_embeddings:


> TypeError: 'dict' object cannot be interpreted as an integer
> TypeError: Dimension value must be integer or None or have an __index__ method, got value 'ListWrapper([<tf.Tensor: shape=(512,), dtype=int32, numpy=
> array...

@mattdangerw Do we have something related already in Keras-nlp?

I actually finally got this sorted out once I stumbled on the BertPackInputs -layer. Still not quite sure what exactly was the difference between the layer outputs and what I previously tried to feed the model though. But the encoder is now running and I’m happy with that. Thanks for your advice anyway :slight_smile:

1 Like