I am trying to vectorize text (wikipedia) using tfds load. I am trying to do something like this
This nlp example contains imdb reviews data and i was able to successfully follow it. But i am not able to do it for wikipedia dataset. Apparently there is some inherent difference between the types of datasets.
I have tried the following
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
# Load Wikipedia dataset from tfds
dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True, split=tfds.Split.TRAIN)
print(type(dataset))
for i in dataset:
print(i['text'].numpy().decode('utf-8'))
# Create a TextVectorization layer to convert text to vectors
vectorize_layer = TextVectorization(
max_tokens=100,
output_mode='int',
output_sequence_length=50
)
# Adapt the vectorization layer to the dataset
vectorize_layer.adapt(dataset.map(lambda x,y: x['text']))
model = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=64, mask_zero=True),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
This much runs without a problem. But when i fit the model
model.fit(dataset, epochs=5)
Then i get the error
TypeError: Expected string passed to parameter ‘input’ of op ‘StringLower’, got {‘text’: <tf.Tensor ‘IteratorGetNext:0’ shape=() dtype=string>, ‘title’: <tf.Tensor ‘IteratorGetNext:1’ shape=() dtype=string>} of type ‘dict’ instead. Error: Expected string, got <tf.Tensor ‘IteratorGetNext:0’ shape=() dtype=string> of type ‘Tensor’ instead.
Hi @srivassid, while reproducing the error by executing the given code, I am facing the error at the above line. I have gone through the dataset and can see that there are no labels present inside the dataset only dict_keys([‘text’, ‘title’]) are present. Could you please let us know how you have defined the labels.
dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True)
# Prepare the text data
texts = [example['text'].numpy().decode('utf-8') for example in dataset['train']]
labels = [0] * len(texts) # Dummy labels for illustration purposes
# Create a TextVectorization layer
vectorize_layer = TextVectorization(
max_tokens=50000, # You can adjust this value based on your requirements
output_mode='tf-idf',
)
# Adapt the layer to the text data
vectorize_layer.adapt(texts)
# Vectorize the text data
vectorized_texts = vectorize_layer(texts)
labels = tf.convert_to_tensor(labels, dtype=tf.float32)
# Build a simple neural network model
model = tf.keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(vectorize_layer.vocabulary_size(),)),
layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(vectorized_texts, labels, epochs=5)