Hi there!
My idea is to see which words my network pay attention the most. The problem is the following (check the code below):
# Get the preprocessor from TF Hub
tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
# Tokenize the text
text_test = ['Where are you going?']
text_preprocessed = bert_preprocess_model(text_test)
The text_processed
variable has three keys: input_word_ids
, input_mask
and input_type_ids
. When you see the output generated by input_word_ids
, those are integers (which is fine), but, in the documentation available for this preprocessor, there is no way to get back those integer to the “token representation”.
Just for clarity: if the code outputs something like this
print(text_preprocessed["input_word_ids"][0, :12])
>> [ 101 2073 2024 2017 2183 1029 102 0 0 0 0 0]
Then, I should get something like this:
['w', '##hee', '##re', 'are', 'you', 'going', '?']
The unique thing that I’ve got are the special tokens using this code:
preprocessor = hub.load(tfhub_handle_preprocess)
preprocessor.tokenize.get_special_tokens_dict()
>> {'start_of_sequence_id': <tf.Tensor: shape=(), dtype=int32, numpy=101>,
'mask_id': <tf.Tensor: shape=(), dtype=int32, numpy=103>,
'end_of_segment_id': <tf.Tensor: shape=(), dtype=int32, numpy=102>,
'padding_id': <tf.Tensor: shape=(), dtype=int32, numpy=0>,
'vocab_size': <tf.Tensor: shape=(), dtype=int32, numpy=30522>}
Thank you, people.