This is something we (@anon1529149 and I) worked on at Carted. Improving text data processing at scale with Beam and Cloud Dataflow.
Blog post:
Code:
We use some tools from the TensorFlow ecosystem such as a BERT model from TensorFlow Hub, TFRecords for serializing the preprocessed data, etc. I hope this will be really beneficial for the community as with these techniques we were able to reduce the total wall-clock time from more than 3 days to under 3 hours.
We further optimized the BERT model we used in the blog post with ONNX (since we run with CPUs) and the pipeline total takes around 1 hr 45 mins now.