Issue with TPUs on Google Colab when training BERT

timodim · March 11, 2022, 1:49pm

Hi there,

I’m pretty new at this, so not sure if this is a bug or I’m doing something wrong.

I’m trying to run BERT in Google Colab using TPU, however I’m getting an error message which can be seen here.

Tensorflow version 2.8.0

Code I’m using for loading the TPU is vastly based on the original code for pre-training T5 by Google taken from here:

print("Installing dependencies...")
%tensorflow_version 2.x

import functools
import os
import time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import tensorflow.compat.v1 as tf
import tensorflow_datasets as tfds

BASE_DIR = "gs://bucket-xx" #@param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
  raise ValueError("You must enter a BASE_DIR.")
DATA_DIR = os.path.join(BASE_DIR, "data/text.csv")
MODELS_DIR = os.path.join(BASE_DIR, "models/bert")
ON_CLOUD = True


if ON_CLOUD:
  print("Setting up GCS access...")
  import tensorflow_gcs_config
  from google.colab import auth
  # Set credentials for GCS reading/writing from Colab and TPU.
  TPU_TOPOLOGY = "v2-8"
  try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    TPU_ADDRESS = tpu.get_master()
    print('Running on TPU:', TPU_ADDRESS)
  except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
  auth.authenticate_user()
  tf.enable_eager_execution()
  tf.config.experimental_connect_to_host(TPU_ADDRESS)
  tensorflow_gcs_config.configure_gcs_from_colab_auth()

tf.disable_v2_behavior()

# Improve logging.
from contextlib import contextmanager
import logging as py_logging

if ON_CLOUD:
  tf.get_logger().propagate = False
  py_logging.root.setLevel('INFO')

@contextmanager
def tf_verbosity_level(level):
  og_level = tf.logging.get_verbosity()
  tf.logging.set_verbosity(level)
  yield
  tf.logging.set_verbosity(og_level)

This is the code that I’m using to train BERT:

!python /content/scripts/run_mlm.py \
--model_name_or_path bert-base-cased \
--tpu_num_cores 8 \
--validation_split_percentage 20 \
--line_by_line \
--learning_rate 2e-5 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 256 \
--num_train_epochs 4 \
--output_dir MODELS_DIR \
--train_file /content/text.csv

the run_mlm.py script is taken from the original transformers repo and can be seen here.

I found a very similar issue which can be seen here.

Any help is much appreciated thanks.

Divya_Sree_Kayyuri · August 22, 2025, 7:36am

Hi @timodim, This issue is old, the information might be outdated. Please try the latest TensorFlow version and hardware to see if that resolves the problem. If you still run into issue, Please provide your current findings so we can investigate further. Thank you!

Topic		Replies	Views
Vanilla Transformer error on training with TPU TensorFlow models , keras , help_request	1	441	September 25, 2023
Problem running Tensorflow BERT tutorial on GPU General Discussion help_request	9	1238	July 10, 2021
[TPU/XLA] Unable to find the relevant tensor remote_handle General Discussion models , xla , tpu , help_request	0	1703	May 22, 2022
[Help!]Using pretrained Embeddings on TPU TensorFlow tpu	4	595	August 24, 2023
Need help training with ModelMaker & Cloud TPU in Colab General Discussion models , model_maker , tpu , tfhub , help_request	2	2145	August 29, 2021

Issue with TPUs on Google Colab when training BERT

Related topics