BertEncoder inputs?

Antti_Y · September 2, 2022, 1:33am

Hi,

I’m trying to build a model part of which would need to be a BertEncoder (models/official/nlp/modeling/networks/bert_encoder.py at master · tensorflow/models · GitHub). However for some reason which I haven’t been able to figure out, I can’t seem to find a correct format in which to pass the inputs to the model. Dict or list, the execution fails to an InvalidArgumentError raised by the position_embedding -layer.

To see what’s happening, I’ve also build the encoder till the word_embedding -layer subclassing tf.keras.Model as in the source code, but here I get a TypeError.

For preprocessing I’ve used a Transformers tokenizer as I’m working with TurkuNLP/bert-base-finnish-cased -model (TurkuNLP/bert-base-finnish-cased-v1 · Hugging Face) and this was readily loadable there. The tokenizer output is of type transformers.tokenization_utils_base.BatchEncoding which I’ve changed to a dict so that I wouldn’t need to install the transformers -library to every Colab Notebook I’ll be working with.

Any advice would be highly appreciated as I’m getting quite frustrated here

Code for the relevant parts below:

import tensorflow as tf
import tensorflow_models as tfm

encoder = tfm.nlp.networks.BertEncoder(vocab_size=50105, type_vocab_size=2)
X_train_inputs

> {'input_word_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[  102, 18381,   519, ...,     0,     0,     0],
>         [  102,   956, 38898, ...,     0,     0,     0],
>         ...,
>         [  102, 11779,  2404, ...,     0,     0,     0],
>         [  102, 38571,  8273, ...,     0,     0,     0]], dtype=int32)>,
>  'input_mask': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[1, 1, 1, ..., 0, 0, 0],
>         [1, 1, 1, ..., 0, 0, 0],
>         ...,
>         [1, 1, 1, ..., 0, 0, 0],
>         [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
>  'input_type_ids': <tf.Tensor: shape=(8000, 512), dtype=int32, numpy=
>  array([[0, 0, 0, ..., 0, 0, 0],
>         [0, 0, 0, ..., 0, 0, 0],
>         ...,
>         [0, 0, 0, ..., 0, 0, 0],
>         [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>}

input_sample = {'input_word_ids' : X_train_inputs['input_word_ids'][0],
                'input_mask' : X_train_inputs['input_mask'][0],
                'input_type_ids' : X_train_inputs['input_type_ids'][0]}
input_sample

> {'input_word_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([  102, 18381,   519,  4404,  2026, 11284, 25142,   119,  2959,
>             ...,
>             0,     0,     0,     0,     0,     0,     0,     0],
>        dtype=int32)>,
>  'input_mask': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>        ...,
>         0, 0, 0, 0, 0, 0], dtype=int32)>,
>  'input_type_ids': <tf.Tensor: shape=(512,), dtype=int32, numpy=
>  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>        ...
>         0, 0, 0, 0, 0, 0], dtype=int32)>}

encoder_layer = hub.KerasLayer(encoder)
encoder_layer(input_sample)

> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
> 
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
> 
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
>   • inputs=tf.Tensor(shape=(512, 768), dtype=float32)

input_list = [input_sample['input_word_ids'], input_sample['input_mask'], input_sample['input_type_ids']]
type(input_list)

> list

encoder_layer(input_list)

> InvalidArgumentError: Exception encountered when calling layer "position_embedding" (type PositionEmbedding).
> 
> Input to reshape is a tensor with 393216 values, but the requested shape has 768 [Op:Reshape]
> 
> Call arguments received by layer "position_embedding" (type PositionEmbedding):
>   • inputs=tf.Tensor(shape=(512, 768), dtype=float32)

The TypeErrors with the subclassed word_embeddings:


> TypeError: 'dict' object cannot be interpreted as an integer
> TypeError: Dimension value must be integer or None or have an __index__ method, got value 'ListWrapper([<tf.Tensor: shape=(512,), dtype=int32, numpy=
> array...

Bhack · September 2, 2022, 6:47pm

@mattdangerw Do we have something related already in Keras-nlp?

Antti_Y · September 3, 2022, 5:21pm

I actually finally got this sorted out once I stumbled on the BertPackInputs -layer. Still not quite sure what exactly was the difference between the layer outputs and what I previously tried to feed the model though. But the encoder is now running and I’m happy with that. Thanks for your advice anyway

Topic		Replies	Views
Tensorflow dataset has () shape General Discussion models , nlp , datasets , help_request	1	2309	May 12, 2022
ValueError: `logits` and `labels` must have the same shape, received ((None, 256, 256, 4) vs (None,)) ValueError: `logits` and `labels` must have the same shape, received ((None, 256, 256, 4) vs (None,)) General Discussion models , keras , help_request	2	5291	June 7, 2024
INVALID_ARGUMENT: Invalid reduction arguments: Axes contains duplicate dimension: 1 General Discussion models , keras , help_request	0	538	April 29, 2023
Model construction using ELMo embeddings and Bi-LSTM issue for sentence level token classification problem General Discussion keras , tfhub	1	678	April 13, 2023
List index out of range while saving a trained model General Discussion models , tpu , help_request	9	5686	March 22, 2022

BertEncoder inputs?

Related topics