NLP TextVectorization tokenizer

Bondi_French · October 18, 2022, 3:38am

Hi,
In previous version of TF, we could use tokenizer = Tokenizer() and then call tokenizer.fit_on_texts(input) where input was a list of texts (in my case, a panda dataframe column containing a list of texts). Unfortunately this has been deprecated.
Is there a way to replicate this behaviour with TextVectorization?

Additionally how can we split a string by Upper case letters: for instance ‘ListOfHorrorMovies’ ?
I understand I need to use the standardize method of TextVecorization
Thanks

Kiran_Sai_Ramineni · January 18, 2023, 4:38pm

Hi @Bondi_French, You can use tf.keras.layers.TextVectorization layer to replicate the same behavior . For more details please go through the code example below

import re
 
# Initialising list of string
sentences = [
    'ILoveMyDog',
    'ILoveMyCat',
    'YouLoveMyDog!',
    'DoYouThinkMyDogIsAmazing?'
] 
 
# Splitting on UpperCase using re
res_list = []
for sentence in sentences:
  res_list.append(re.findall('[A-Z][^A-Z]*', sentence))
 
# Printing result
processed_sentences=[]
for i in res_list:
  processed_sentences.append((" ".join(i)))
print(processed_sentences)
#output
['I Love My Dog', 'I Love My Cat', 'You Love My Dog!', 'Do You Think My Dog Is Amazing?']

import tensorflow as tf

text_dataset = tf.data.Dataset.from_tensor_slices(processed_sentences)
max_features = 5000  # Maximum vocab size.
max_len = 10
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary.
vectorize_layer.adapt(text_dataset)

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing
# vocab indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an
# embedding layer to map these integers to learned embeddings.
input_data=[
    'i really love my dog',
    'my dog loves my manatee'
]
model.predict(input_data)

#output
1/1 [==============================] - 0s 297ms/step
array([[6, 1, 3, 2, 4, 0, 0, 0, 0, 0],
       [2, 4, 1, 2, 1, 0, 0, 0, 0, 0]])

Thank You.

Topic		Replies	Views
Trouble using tfds load to vectorize text Keras datasets , epoc , tfkeras	2	260	February 6, 2024
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` General Discussion models , transformers	3	3165	January 9, 2023
When using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: General Discussion models , help_request , tensorflow	2	4146	April 10, 2023
Converting Words into ids using tf.keras.layers.StringLookup General Discussion nlp , keras , help_request	5	906	May 18, 2022
How to convert string to indices in a rnn model while exporting it using saved_model.save General Discussion models , nlp , datasets , help_request	1	1271	October 25, 2023

NLP TextVectorization tokenizer

Related topics