Object detection model building fails on mac m2 with gpu usage with weird error

I am working on creating a custom dataset model with mask_rcnn_inception_resnet as a base model. I have managed to execute a training run on Ubuntu CPU . Now I am trying to make it work on Macbook M2.

My test runs as advised by various sources are all successful such as -

https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html

From within TensorFlow/models/research/

python object_detection/builders/model_builder_tf2_test.py

But when I am running my actual model training script I am facing a weird error -

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
	 [[{{node GatherV2_7}}]]
	 [[MultiDeviceIteratorGetNextFromShard]]
	 [[RemoteCall]] [Op:IteratorGetNext] name:

The full console log below

python3 model_main_tf2.py --model_dir=models/ark_mask_rcnn_inception_resnet_v2 --pipeline_config_path=models/ark_mask_rcnn_inception_resnet_v2/pipeline.config
2023-09-10 23:11:55.486121: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Pro
2023-09-10 23:11:55.486143: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-09-10 23:11:55.486147: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-09-10 23:11:55.486172: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 23:11:55.486190: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-10 23:11:55.487664: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 23:11:55.487673: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0910 23:11:55.487923 8568659456 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0910 23:11:55.496275 8568659456 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0910 23:11:55.496325 8568659456 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0910 23:11:55.509262 8568659456 deprecation.py:364] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I0910 23:11:55.512531 8568659456 dataset_builder.py:162] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I0910 23:11:55.512593 8568659456 dataset_builder.py:79] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0910 23:11:55.512616 8568659456 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0910 23:11:55.512634 8568659456 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
W0910 23:11:55.515808 8568659456 deprecation.py:364] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W0910 23:11:55.524945 8568659456 deprecation.py:364] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0910 23:11:56.154265 8568659456 deprecation.py:569] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0910 23:11:58.014597 8568659456 deprecation.py:364] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0910 23:11:58.898831 8568659456 deprecation.py:364] From /Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2023-09-10 23:12:00.177526: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-09-10 23:12:00.181435: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 126, in <module>
    tf.compat.v1.app.run()
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 117, in main
    model_lib_v2.train_loop(
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 605, in train_loop
    load_fine_tune_checkpoint(
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 161, in _ensure_model_is_built
    features, labels = iter(input_dataset).next()
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 260, in next
    return self.__next__()
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
    return self.get_next()
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 325, in get_next
    return self._get_next_no_partial_batch_handling(name)
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 361, in _get_next_no_partial_batch_handling
    replicas.extend(self._iterators[i].get_next_as_list(new_name))
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 1427, in get_next_as_list
    return self._format_data_list_with_options(self._iterator.get_next())
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 553, in get_next
    result.append(self._device_iterators[i].get_next())
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 867, in get_next
    return self._next_internal()
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/Users/_dga/anaconda3/envs/tf-ark/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
	 [[{{node GatherV2_7}}]]
	 [[MultiDeviceIteratorGetNextFromShard]]
	 [[RemoteCall]] [Op:IteratorGetNext] name: 
(tf-ark)  _dga@ :> 

In absence of any response to this question / help request, I had do my own investigation and assumption as listed in the only answer here → python 3.x - Training custom data set model using mask_rcnn_inception from tensorflow model zoo on Macbook pro M2 - Stack Overflow

for future reference ^^

1 Like

This assumption also seems to fall flat and could be a compability issue of the model with TF2 or python… multiple other people have listed bugs matching this error but no resolution yet.

https://github.com/tensorflow/models/issues/9067

more issues matching it.

@gautam,

Unfortunately, we do not support research models and suggest you to use official object detection models.

On M1 MacBook Pro, we ran the same code and it is working as expected.

Please see the gist for running on M1.

Thank you!

1 Like

@chunduriv I am trying to run the above gist on M2. but running into errors

As you mentioned, I cant install tf-models-official package for 2.13 but I am able to install the same from models repo which gives me 2.5 version …

Is there an api change or am I missing a package. I am using the same ipynb file from the gist so I believe I have installed all the required packages.