What is the best model architecture in TensorFlow for extracting text from any image (OCR)?

Hi TensorFlow community,

I’m working on a deep learning project where I want to extract all visible text from any kind of image — including medicine cartons, product labels, and scanned documents.
Objective:
Build a TensorFlow/Keras model that outputs all the text present in an input image.
I’ve tried:

  • Implemented a CRNN model (CNN + BiLSTM + CTC loss) using TensorFlow and Keras.
  • Preprocessed images by resizing to fixed height and padded to maintain aspect ratio.
  • Used a custom character set for CTC decoding.

Are there any better model architectures (like ViT or encoder-decoder transformers) that work well in TensorFlow?

  • Tips on:
    • Preprocessing input images
    • Handling different font styles and small text
    • Preparing the training dataset
  • Open-source Keras model references (if available)

Thanks in advance for your guidance!