Hi,
In previous version of TF, we could use tokenizer = Tokenizer()
and then call tokenizer.fit_on_texts(input)
where input was a list of texts (in my case, a panda dataframe column containing a list of texts). Unfortunately this has been deprecated.
Is there a way to replicate this behaviour with TextVectorization?
Additionally how can we split a string by Upper case letters: for instance ‘ListOfHorrorMovies’ ?
I understand I need to use the standardize
method of TextVecorization
Thanks
Hi @Bondi_French, You can use tf.keras.layers.TextVectorization layer to replicate the same behavior . For more details please go through the code example below
import re
# Initialising list of string
sentences = [
'ILoveMyDog',
'ILoveMyCat',
'YouLoveMyDog!',
'DoYouThinkMyDogIsAmazing?'
]
# Splitting on UpperCase using re
res_list = []
for sentence in sentences:
res_list.append(re.findall('[A-Z][^A-Z]*', sentence))
# Printing result
processed_sentences=[]
for i in res_list:
processed_sentences.append((" ".join(i)))
print(processed_sentences)
#output
['I Love My Dog', 'I Love My Cat', 'You Love My Dog!', 'Do You Think My Dog Is Amazing?']
import tensorflow as tf
text_dataset = tf.data.Dataset.from_tensor_slices(processed_sentences)
max_features = 5000 # Maximum vocab size.
max_len = 10
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary.
vectorize_layer.adapt(text_dataset)
# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()
# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing
# vocab indices.
model.add(vectorize_layer)
# Now, the model can map strings to integers, and you can add an
# embedding layer to map these integers to learned embeddings.
input_data=[
'i really love my dog',
'my dog loves my manatee'
]
model.predict(input_data)
#output
1/1 [==============================] - 0s 297ms/step
array([[6, 1, 3, 2, 4, 0, 0, 0, 0, 0],
[2, 4, 1, 2, 1, 0, 0, 0, 0, 0]])
Thank You.