Get stuck on running distributed training using MultiWorkerMirroredStrategy

Hi guys, currently I am trying to set up a distributed training cluster using 2 Linux GPU machines.
My runtime is the latest Tensorflow Jupyter GPU Docker image (TF 2.7.0) on both machines, the code for the trainig is the Tensorflow Object_Detection “basic code” from models/research/object_detection/model_main_tf2.py of the GitHub - tensorflow/models: Models and examples built with TensorFlow repository.

(Training example here: Training Custom Object Detector — TensorFlow 2 Object Detection API tutorial documentation).

The training is working well on each machine for it’s own, but if I’am increase the “num_workers” (causes to use MultiWorkerMirroredStrategy) and the “batch_size” to two (The TF_CONFIG env is set at this point) I can see that the two workers start connecting, but then one worker fail with this exception.

2021-12-17 15:27:41.389682: W tensorflow/core/framework/dataset.cc:744] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 683, in next
return self.get_next()
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 740, in get_next
return self._get_next_no_partial_batch_handling(name)
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 772, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 2021, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py”, line 584, in get_next
result.append(self._device_iterators[i].get_next())
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py”, line 853, in get_next
return self._next_internal()
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py”, line 783, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_dataset_ops.py”, line 2845, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py”, line 7107, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “model_main_tf2.py”, line 119, in
tf.compat.v1.app.run()
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py”, line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File “/usr/local/lib/python3.8/dist-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/usr/local/lib/python3.8/dist-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “model_main_tf2.py”, line 109, in main
model_lib_v2.train_loop(
File “/usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py”, line 605, in train_loop
load_fine_tune_checkpoint(
File “/usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py”, line 400, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File “/usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py”, line 160, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 679, in next
return self.next()
File “/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py”, line 685, in next
raise StopIteration
StopIteration

Any ideas?

Greeting Michael

1 Like

Hello @Michael_Kilian

Thank you for using TensorFlow,

Could you please check TF_CONFIG is correctly set in two machines i.e the chief worker should have

"task": {"type": "worker", "index": 0}

other worker should have

"task": {"type": "worker", "index": 1}

please check the data pipeline as the error indicating the OutOfRangeError. Please check documentation of DistributedDataset(tf.distribute.DistributedDataset  |  TensorFlow v2.16.1)
and migrate the code to latest tensorflow version for better compatibility.
Thank You.