TextVectorization significantly slower than sklearn's CountVectorizer

csashiuber · August 14, 2021, 2:35am

Hello,
We are trying to replace sklearn CountVectorizer with TextVectorization. We experimented with a text dataset of 400K sentences with unigrams and bigrams. We used adapt to build the vocabulary.

We wrote a small Keras model as shown in the documentation with a vectorization layer and compiled the model.

The predict is significantly slower (by 500 times on a desktop, no GPUs) than sklearn’s transfrom function. Any suggestions?

Thank you!

mattdangerw · August 31, 2021, 9:17pm

I would guess this is an issue with output representation. By default, TextVectorization will use a dense Tensor output–this is simpler for small examples, and guarantees the output of the layer can be used with any other keras layer. But when using output_mode='count' (or 'multi_hot' or 'tf_idf') and a large vocabulary (vocab is probably quite large when using bigrams), this dense output gets very inefficient very quickly.

We recently added an option for sparse output from the TextVectorization layer. This is currently on only in tf-nightly, and will be available in a stable release in tensorflow 2.7.

tf.keras.layers.TextVectorization(output_mode='count', sparse=True)

The sparse output from the layer constructed like this could then be fed into a tf.keras.layers.Dense, and scale up much more effectively.

csashiuber · August 31, 2021, 9:53pm

Thanks Mathew. Setting sparse=True brings it on par with sklearn.

Topic		Replies	Views
NLP TextVectorization tokenizer General Discussion nlp	1	725	January 18, 2023
Text-based Tensorflow unexpected result of train_function (empty logs) General Discussion models , nlp , keras , tfdata , help_request	5	6815	July 27, 2022
Fails when the dataframe has over 60 000 rows General Discussion help_request	4	689	January 12, 2023
Trouble using tfds load to vectorize text Keras datasets , epoc , tfkeras	2	260	February 6, 2024
Why `tf.keras.applications` is so slow? General Discussion api , keras , performance , help_request	1	871	July 7, 2021

TextVectorization significantly slower than sklearn's CountVectorizer

Related topics