What is the best model architecture in TensorFlow for extracting text from any image (OCR)?

Hi TensorFlow community,

I’m working on a deep learning project where I want to extract all visible text from any kind of image — including medicine cartons, product labels, and scanned documents.
Objective:
Build a TensorFlow/Keras model that outputs all the text present in an input image.
I’ve tried:

  • Implemented a CRNN model (CNN + BiLSTM + CTC loss) using TensorFlow and Keras.
  • Preprocessed images by resizing to fixed height and padded to maintain aspect ratio.
  • Used a custom character set for CTC decoding.

Are there any better model architectures (like ViT or encoder-decoder transformers) that work well in TensorFlow?

  • Tips on:
    • Preprocessing input images
    • Handling different font styles and small text
    • Preparing the training dataset
  • Open-source Keras model references (if available)

Thanks in advance for your guidance!

Hi @vani_anandan, To get started with building OCR, refer to this resources. Document1, Document2. Thanks!