Tensorflow model garden with different models

Hi,
I’m trying to follow this colab notebook with different model. Google Colab

However, I faced with some errors.

I’m trying to use cascadercnn_spinenet_coco as a model with following code:

exp_config = exp_factory.get_exp_config(‘cascadercnn_spinenet_coco’)

My model change the model configuration with following lines:
batch_size = 16

num_classes = 1

HEIGHT, WIDTH = 512, 512

IMG_SIZE = [HEIGHT, WIDTH, 3]

Backbone config.

exp_config.runtime.num_gpus=1

exp_config.task.freeze_backbone = True

exp_config.task.annotation_file = ‘’

Model config.

exp_config.task.model.input_size = IMG_SIZE

exp_config.task.model.num_classes = num_classes + 1

exp_config.task.model.detection_generator.max_classes_per_detection = exp_config.task.model.num_classes

Training data config.

exp_config.task.train_data.input_path = train_data_input_path

exp_config.task.train_data.dtype = ‘float32’

exp_config.task.train_data.global_batch_size = batch_size

exp_config.task.train_data.parser.aug_scale_max = 1.0

exp_config.task.train_data.parser.aug_scale_min = 1.0

Validation data config.

exp_config.task.validation_data.input_path = valid_data_input_path

exp_config.task.validation_data.dtype = ‘float32’

exp_config.task.validation_data.global_batch_size = batch_size

train_steps = 50000

exp_config.trainer.steps_per_loop = 100 # steps_per_loop = num_of_training_examples // train_batch_size

exp_config.runtime.num_gpus = 1

exp_config.trainer.summary_interval = 100

exp_config.trainer.checkpoint_interval = 100

exp_config.trainer.validation_interval = 100

exp_config.trainer.validation_steps = 100 # validation_steps = num_of_validation_examples // eval_batch_size

exp_config.trainer.train_steps = train_steps

exp_config.trainer.optimizer_config.warmup.linear.warmup_steps = 100

exp_config.trainer.optimizer_config.learning_rate.type = ‘cosine’

exp_config.trainer.optimizer_config.learning_rate.cosine.decay_steps = train_steps

exp_config.trainer.optimizer_config.learning_rate.cosine.initial_learning_rate = 0.1

exp_config.trainer.optimizer_config.warmup.linear.warmup_learning_rate = 0.05

Configured Model:

{ ‘runtime’: { ‘all_reduce_alg’: None,
‘batchnorm_spatial_persistent’: False,
‘dataset_num_private_threads’: None,
‘default_shard_dim’: -1,
‘distribution_strategy’: ‘mirrored’,
‘enable_xla’: False,
‘gpu_thread_mode’: None,
‘loss_scale’: None,
‘mixed_precision_dtype’: ‘bfloat16’,
‘num_cores_per_replica’: 1,
‘num_gpus’: 1,
‘num_packs’: 1,
‘per_gpu_thread_count’: 0,
‘run_eagerly’: False,
‘task_index’: -1,
‘tpu’: None,
‘tpu_enable_xla_dynamic_padder’: None,
‘use_tpu_mp_strategy’: False,
‘worker_hosts’: None},
‘task’: { ‘allow_image_summary’: False,
‘allowed_mask_class_ids’: None,
‘annotation_file’: ‘’,
‘differential_privacy_config’: None,
‘freeze_backbone’: True,
‘init_checkpoint’: None,
‘init_checkpoint_modules’: ‘all’,
‘losses’: { ‘class_weights’: None,
‘frcnn_box_weight’: 1.0,
‘frcnn_class_loss_top_k_percent’: 1.0,
‘frcnn_class_use_binary_cross_entropy’: False,
‘frcnn_class_weight’: 1.0,
‘frcnn_huber_loss_delta’: 1.0,
‘l2_weight_decay’: 4e-05,
‘loss_weight’: 1.0,
‘mask_weight’: 1.0,
‘rpn_box_weight’: 1.0,
‘rpn_huber_loss_delta’: 0.1111111111111111,
‘rpn_score_weight’: 1.0},
‘model’: { ‘anchor’: { ‘anchor_size’: 3,
‘aspect_ratios’: [0.5, 1.0, 2.0],
‘num_scales’: 1},
‘backbone’: { ‘spinenet’: { ‘max_level’: 7,
‘min_level’: 3,
‘model_id’: ‘49’,
‘stochastic_depth_drop_rate’: 0.0},
‘type’: ‘spinenet’},
‘decoder’: {‘identity’: {}, ‘type’: ‘identity’},
‘detection_generator’: { ‘apply_nms’: True,
‘max_classes_per_detection’: 2,
‘max_num_detections’: 100,
‘nms_iou_threshold’: 0.5,
‘nms_version’: ‘v2’,
‘pre_nms_score_threshold’: 0.05,
‘pre_nms_top_k’: 5000,
‘soft_nms_sigma’: None,
‘use_cpu_nms’: False,
‘use_sigmoid_probability’: False},
‘detection_head’: { ‘cascade_class_ensemble’: True,
‘class_agnostic_bbox_pred’: True,
‘fc_dims’: 1024,
‘num_convs’: 4,
‘num_fcs’: 1,
‘num_filters’: 256,
‘use_separable_conv’: False},
‘include_mask’: True,
‘input_size’: [512, 512, 3],
‘mask_head’: { ‘class_agnostic’: False,
‘num_convs’: 4,
‘num_filters’: 256,
‘upsample_factor’: 2,
‘use_separable_conv’: False},
‘mask_roi_aligner’: { ‘crop_size’: 14,
‘sample_offset’: 0.5},
‘mask_sampler’: {‘num_sampled_masks’: 128},
‘max_level’: 7,
‘min_level’: 3,
‘norm_activation’: { ‘activation’: ‘swish’,
‘norm_epsilon’: 0.001,
‘norm_momentum’: 0.99,
‘use_sync_bn’: True},
‘num_classes’: 2,
‘outer_boxes_scale’: 1.0,
‘roi_aligner’: { ‘crop_size’: 7,
‘sample_offset’: 0.5},
‘roi_generator’: { ‘nms_iou_threshold’: 0.7,
‘num_proposals’: 1000,
‘pre_nms_min_size_threshold’: 0.0,
‘pre_nms_score_threshold’: 0.0,
‘pre_nms_top_k’: 2000,
‘test_nms_iou_threshold’: 0.7,
‘test_num_proposals’: 1000,
‘test_pre_nms_min_size_threshold’: 0.0,
‘test_pre_nms_score_threshold’: 0.0,
‘test_pre_nms_top_k’: 1000,
‘use_batched_nms’: False},
‘roi_sampler’: { ‘background_iou_high_threshold’: 0.5,
‘background_iou_low_threshold’: 0.0,
‘cascade_iou_thresholds’: [ 0.6,
0.7],
‘foreground_fraction’: 0.25,
‘foreground_iou_threshold’: 0.5,
‘mix_gt_boxes’: True,
‘num_sampled_rois’: 512},
‘rpn_head’: { ‘num_convs’: 1,
‘num_filters’: 256,
‘use_separable_conv’: False}},
‘name’: None,
‘per_category_metrics’: False,
‘train_data’: { ‘apply_tf_data_service_before_batching’: False,
‘autotune_algorithm’: None,
‘block_length’: 1,
‘cache’: False,
‘cycle_length’: None,
‘decoder’: { ‘simple_decoder’: { ‘attribute_names’: [ ],
‘mask_binarize_threshold’: None,
‘regenerate_source_id’: False},
‘type’: ‘simple_decoder’},
‘deterministic’: None,
‘drop_remainder’: True,
‘dtype’: ‘float32’,
‘enable_shared_tf_data_service_between_parallel_trainers’: False,
‘enable_tf_data_service’: False,
‘file_type’: ‘tfrecord’,
‘global_batch_size’: 16,
‘input_path’: ‘./pothole_coco_tfrecords/train-00000-of-00001.tfrecord’,
‘is_training’: True,
‘num_examples’: -1,
‘parser’: { ‘aug_rand_hflip’: True,
‘aug_rand_vflip’: False,
‘aug_scale_max’: 1.0,
‘aug_scale_min’: 1.0,
‘aug_type’: None,
‘mask_crop_size’: 112,
‘match_threshold’: 0.5,
‘max_num_instances’: 100,
‘num_channels’: 3,
‘pad’: True,
‘rpn_batch_size_per_im’: 256,
‘rpn_fg_fraction’: 0.5,
‘rpn_match_threshold’: 0.7,
‘rpn_unmatched_threshold’: 0.3,
‘skip_crowd_during_training’: True,
‘unmatched_threshold’: 0.5},
‘prefetch_buffer_size’: None,
‘seed’: None,
‘sharding’: True,
‘shuffle_buffer_size’: 10000,
‘tf_data_service_address’: None,
‘tf_data_service_job_name’: None,
‘tfds_as_supervised’: False,
‘tfds_data_dir’: ‘’,
‘tfds_name’: ‘’,
‘tfds_skip_decoding_feature’: ‘’,
‘tfds_split’: ‘’,
‘trainer_id’: None,
‘weights’: None},
‘use_approx_instance_metrics’: False,
‘use_coco_metrics’: True,
‘use_wod_metrics’: False,
‘validation_data’: { ‘apply_tf_data_service_before_batching’: False,
‘autotune_algorithm’: None,
‘block_length’: 1,
‘cache’: False,
‘cycle_length’: None,
‘decoder’: { ‘simple_decoder’: { ‘attribute_names’: [ ],
‘mask_binarize_threshold’: None,
‘regenerate_source_id’: False},
‘type’: ‘simple_decoder’},
‘deterministic’: None,
‘drop_remainder’: False,
‘dtype’: ‘float32’,
‘enable_shared_tf_data_service_between_parallel_trainers’: False,
‘enable_tf_data_service’: False,
‘file_type’: ‘tfrecord’,
‘global_batch_size’: 16,
‘input_path’: ‘./pothole_coco_tfrecords/valid-00000-of-00001.tfrecord’,
‘is_training’: False,
‘num_examples’: -1,
‘parser’: { ‘aug_rand_hflip’: False,
‘aug_rand_vflip’: False,
‘aug_scale_max’: 1.0,
‘aug_scale_min’: 1.0,
‘aug_type’: None,
‘mask_crop_size’: 112,
‘match_threshold’: 0.5,
‘max_num_instances’: 100,
‘num_channels’: 3,
‘pad’: True,
‘rpn_batch_size_per_im’: 256,
‘rpn_fg_fraction’: 0.5,
‘rpn_match_threshold’: 0.7,
‘rpn_unmatched_threshold’: 0.3,
‘skip_crowd_during_training’: True,
‘unmatched_threshold’: 0.5},
‘prefetch_buffer_size’: None,
‘seed’: None,
‘sharding’: True,
‘shuffle_buffer_size’: 10000,
‘tf_data_service_address’: None,
‘tf_data_service_job_name’: None,
‘tfds_as_supervised’: False,
‘tfds_data_dir’: ‘’,
‘tfds_name’: ‘’,
‘tfds_skip_decoding_feature’: ‘’,
‘tfds_split’: ‘’,
‘trainer_id’: None,
‘weights’: None}},
‘trainer’: { ‘allow_tpu_summary’: False,
‘best_checkpoint_eval_metric’: ‘’,
‘best_checkpoint_export_subdir’: ‘’,
‘best_checkpoint_metric_comp’: ‘higher’,
‘checkpoint_interval’: 100,
‘continuous_eval_timeout’: 3600,
‘eval_tf_function’: True,
‘eval_tf_while_loop’: False,
‘loss_upper_bound’: 1000000.0,
‘max_to_keep’: 5,
‘optimizer_config’: { ‘ema’: None,
‘learning_rate’: { ‘cosine’: { ‘alpha’: 0.0,
‘decay_steps’: 50000,
‘initial_learning_rate’: 0.1,
‘name’: ‘CosineDecay’,
‘offset’: 0},
‘type’: ‘cosine’},
‘optimizer’: { ‘sgd’: { ‘clipnorm’: None,
‘clipvalue’: None,
‘decay’: 0.0,
‘global_clipnorm’: None,
‘momentum’: 0.9,
‘name’: ‘SGD’,
‘nesterov’: False},
‘type’: ‘sgd’},
‘warmup’: { ‘linear’: { ‘name’: ‘linear’,
‘warmup_learning_rate’: 0.05,
‘warmup_steps’: 100},
‘type’: ‘linear’}},
‘preemption_on_demand_checkpoint’: True,
‘recovery_begin_steps’: 0,
‘recovery_max_trials’: 0,
‘steps_per_loop’: 100,
‘summary_interval’: 100,
‘train_steps’: 50000,
‘train_tf_function’: True,
‘train_tf_while_loop’: True,
‘validation_interval’: 100,
‘validation_steps’: 100,
‘validation_summary_subdir’: ‘validation’}}

Faced error:

When ı run the following code:

for images, labels in task.build_inputs(exp_config.task.train_data).take(1):
print()
print(f’images.shape: {str(images.shape):16} images.dtype: {images.dtype!r}‘)
print(f’labels.keys: {labels.keys()}’)

output error :

InvalidArgumentError Traceback (most recent call last)
in <cell line: 1>()
----> 1 for images, labels in task.build_inputs(exp_config.task.train_data).take(1):
2 print()
3 print(f’images.shape: {str(images.shape):16} images.dtype: {images.dtype!r}‘)
4 print(f’labels.keys: {labels.keys()}’)

3 frames
/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
5881 def raise_from_not_ok_status(e, name) → NoReturn:
5882 e.message += (" name: " + str(name if name is not None else “”))
→ 5883 raise core._status_to_exception(e) from None # pylint: disable=protected-access
5884
5885

InvalidArgumentError: {{function_node _wrapped__IteratorGetNext_output_types_21_device/job:localhost/replica:0/task:0/device:CPU:0}} indices[0] = 0 is not in [0, 0)
[[{{node GatherV2_2}}]] [Op:IteratorGetNext] name:

When I run the training with following code:

model, eval_logs = tfm.core.train_lib.run_experiment(
distribution_strategy=distribution_strategy,
task=task,
mode=‘train_and_eval’,
params=exp_config,
model_dir=model_dir,
run_post_eval=True)

Output error:

WARNING:absl:SpineNet output level 2 out of range [min_level, max_level] = [3, 7] will not be used for further processing.

InvalidArgumentError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model, eval_logs = tfm.core.train_lib.run_experiment(
2 distribution_strategy=distribution_strategy,
3 task=task,
4 mode=‘train_and_eval’,
5 params=exp_config,

17 frames
/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
5881 def raise_from_not_ok_status(e, name) → NoReturn:
5882 e.message += (" name: " + str(name if name is not None else “”))
→ 5883 raise core._status_to_exception(e) from None # pylint: disable=protected-access
5884
5885

InvalidArgumentError: {{function_node _wrapped__StridedSlice_device/job:localhost/replica:0/task:0/device:CPU:0}} slice index 0 of dimension 1 out of bounds. [Op:StridedSlice] name: strided_slice/
In call to configurable ‘Trainer’ (<class ‘official.core.base_trainer.Trainer’>)
In call to configurable ‘create_trainer’ (<function create_trainer at 0x7e67d0270f70>)

I think two errors have same problem which is related with batch size but I did not figure it out how to solve. Thank you for your interest.

@Taha_Er you should carefully review the code related to batch size, both in the model configuration and the data loading pipeline. Make sure that the specified batch size is consistent across all relevant components. Additionally, consider printing or logging the batch size at different points in the code to verify its consistency. If the issue persists, checking compatibility with the TensorFlow version and reviewing the data loading process for any unexpected changes.

Sorry for late answer, I am still facing with same problem.

@Taha_Er

  1. Print Batch Size: Print the batch size at different code points to ensure consistency.

pythonCopy code

print(f'Batch Size: {batch_size}')
  1. Inspect Data Shapes: Log the shapes of input data during loading to identify inconsistencies.

pythonCopy code

print(f'Images shape: {images.shape}, Labels shape: {labels.shape}')
  1. Verify TensorFlow Version: Check and print the TensorFlow version for compatibility.

pythonCopy code

import tensorflow as tf
print(tf.__version__)
  1. Check Model Compatibility: Review model documentation and source code for compatibility. Confirm input requirements match data shape.
  2. Review Model Architecture: Check the CascadeRCNN model configuration for correct parameters. Ensure alignment with input data.
  3. Test with Smaller Dataset: Train the model on a smaller dataset to isolate the issue.
  4. Check GitHub Issues or Forums: Search for similar issues or solutions related to the model version.
  5. Consider Code Execution Order: Verify cells are executed in the correct order.

If issues persist, provide more details on the error location and messages for targeted assistance.