Is there any way, where I can tokenize texts from tf.string
with AutoTokenizer
from transformers? Cause in this way, we can use transformers inside existing TensorFlow models, and it will be a lot faster.
This also leads to endless possibilities, as we will be able to use multiple models parallel with concat
.
Let’s say I have this piece of code:
def get_model():
text_input = Input(shape=(), dtype=tf.string, name='text')
MODEL = "ping/pong"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
transformer_layer = TFAutoModel.from_pretrained(MODEL)
preprocessed_text = tokenizer(text_input)
outputs = transformer_layer(preprocessed_text)
output_sequence = outputs['sequence_output']
x = Flatten()(output_sequence)
x = Dense(NUM_CLASS, activation='sigmoid')(x)
model = Model(inputs=[text_input], outputs = [x])
return model
But this gives me an error saying:
ValueError Traceback (most recent call last)
/tmp/ipykernel_27/788693747.py in <module>
1 optimizer = Adam()
----> 2 model = get_model()
3 model.compile(loss=CategoricalCrossentropy(from_logits=True),optimizer=optimizer,metrics=[Accuracy(), ],)
4 model.summary()
/tmp/ipykernel_27/330097806.py in get_model()
6
7 text_input = Input(shape=(), dtype=tf.string, name='text')
----> 8 preprocessed_text = tokenizer(text_input)
9 outputs = transformer_layer(preprocessed_text)
10
/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2466 if not _is_valid_text_input(text):
2467 raise ValueError(
-> 2468 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
2469 "or `List[List[str]]` (batch of pretokenized examples)."
2470 )
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).