Getting word ids back to strings

fepac · September 13, 2022, 1:45am

Hi there!

My idea is to see which words my network pay attention the most. The problem is the following (check the code below):

# Get the preprocessor from TF Hub
tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

# Tokenize the text
text_test = ['Where are you going?']
text_preprocessed = bert_preprocess_model(text_test)

The text_processed variable has three keys: input_word_ids, input_mask and input_type_ids. When you see the output generated by input_word_ids, those are integers (which is fine), but, in the documentation available for this preprocessor, there is no way to get back those integer to the “token representation”.

Just for clarity: if the code outputs something like this

print(text_preprocessed["input_word_ids"][0, :12])
>> [ 101 2073 2024 2017 2183 1029  102    0    0    0    0    0]

Then, I should get something like this:

['w', '##hee', '##re', 'are', 'you', 'going', '?']

The unique thing that I’ve got are the special tokens using this code:

preprocessor = hub.load(tfhub_handle_preprocess)
preprocessor.tokenize.get_special_tokens_dict()
>> {'start_of_sequence_id': <tf.Tensor: shape=(), dtype=int32, numpy=101>,
 'mask_id': <tf.Tensor: shape=(), dtype=int32, numpy=103>,
 'end_of_segment_id': <tf.Tensor: shape=(), dtype=int32, numpy=102>,
 'padding_id': <tf.Tensor: shape=(), dtype=int32, numpy=0>,
 'vocab_size': <tf.Tensor: shape=(), dtype=int32, numpy=30522>}

Thank you, people.

Kiran_Sai_Ramineni · January 23, 2023, 11:15am

Hi @fepac, I don’t think you will get the exactly the same output you want, i depends the vocab file of the model. For more details please refer to this gist. Thank You.

Topic		Replies	Views
How to convert string to indices in a rnn model while exporting it using saved_model.save General Discussion models , nlp , datasets , help_request	1	1270	October 25, 2023
How to add multiple pre-processing steps and a post-processing step for text-classifiction model to serve via tensorflow-serving? TFX-Addons models , help_request	5	4136	December 28, 2021
Converting Words into ids using tf.keras.layers.StringLookup General Discussion nlp , keras , help_request	5	875	May 18, 2022
NLP : Question while working with ALBERT General Discussion nlp , tflite , tfhub , help_request	2	1099	August 4, 2021
NLP TextVectorization tokenizer General Discussion nlp	1	685	January 18, 2023

Getting word ids back to strings

Related topics