Improving Dataflow Pipelines for Text Data Processing

Sayak_Paul · March 3, 2022, 10:23am

This is something we (@anon1529149 and I) worked on at Carted. Improving text data processing at scale with Beam and Cloud Dataflow.

Blog post:

Code:

We use some tools from the TensorFlow ecosystem such as a BERT model from TensorFlow Hub, TFRecords for serializing the preprocessed data, etc. I hope this will be really beneficial for the community as with these techniques we were able to reduce the total wall-clock time from more than 3 days to under 3 hours.

We further optimized the BERT model we used in the blog post with ONNX (since we run with CPUs) and the pipeline total takes around 1 hr 45 mins now.

lgusm · March 3, 2022, 11:48am

This is super cool!!! Congrats!

Question: why the last step makes the model better, what’s changed on the model? does it replace ops with optimised ones for CPU?

Sayak_Paul · March 3, 2022, 12:04pm

Do you mean the ONNX conversion step? If so, then it is because ONNX performs layer fusion, replaces layers producing constant values, etc. It simplifies the model graph and hence the latency gets reduced.

lgusm · March 3, 2022, 2:23pm

yes, it was the ONNX conversion step, thanks!

Topic		Replies	Views
Handling variable-length sequences in TensorFlow Show and Tell keras , education	1	2095	April 29, 2022
Reducing the parameter size of LaBSE(language-agnostic BERT Sentence Embedding) for practical usage Show and Tell nlp , tfhub	8	2656	September 20, 2021
How to change seq length in BERT preprocessor from TF Hub General Discussion tfhub , help_request	1	2375	August 26, 2021
Tensorflow-Transformers 2.0 ( for NLP, CV, Audio ) Show and Tell release , nlp	0	1325	April 8, 2022
CsvExampleGen and FileBasedExampleGen taking too long for processing data General Discussion tfx , help_request	2	512	January 17, 2025

Improving Dataflow Pipelines for Text Data Processing

Related topics