MobileNet SSD input image size

I am trying to implement an object detection algorithm using MobileNet SSD v1 fpn 640x640 using Tensorflow Object Detection API. But my input image is of size 1024x25, but it gave me errors regarding image size.

I would like to know what are the constraints related to input image size while using Mobilenet SSD and is there any alternate ways to implement the object detection algorithm on my images…

Hi @Mary_Joy, In the config file could you please try by changing the height and width in the image_resizer part. Thank You.

1 Like

Thank you for you reply.
I tried changing the dimensions to 1024x25 but got the error like ‘minimum height should be 33’.

Then I changed the dimensions to 1024x33 and got the following error.

ValueError: Dimensions must be equal, but are 4 and 3 for '{{node ssd_mobile_net_v2_fpn_keras_feature_extractor/FeatureMaps/top_down/add}} = AddV2[T=DT_FLOAT](ssd_mobile_net_v2_fpn_keras_feature_extractor/FeatureMaps/top_down/nearest_neighbor_upsampling/nearest_neighbor_upsampling/Reshape_1, ssd_mobile_net_v2_fpn_keras_feature_extractor/FeatureMaps/top_down/projection_2/BiasAdd)' with input shapes: [16,4,64,128], [16,3,64,128].
        
        
        Call arguments received:
          • image_features=[("'layer_7'", 'tf.Tensor(shape=(16, 5, 128, 32), dtype=float32)'), ("'layer_14'", 'tf.Tensor(shape=(16, 3, 64, 96), dtype=float32)'), ("'layer_19'", 'tf.Tensor(shape=(16, 2, 32, 1280), dtype=float32)')]
    
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(16, 33, 1024, 3), dtype=float32)
      • kwargs={'training': 'False'}

Given below is my pipeline.config file

# SSD with Mobilenet v2 FPN-lite (go/fpn-lite) feature extractor, shared box
# predictor and focal loss (a mobile version of Retinanet).
# Retinanet: see Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint
# Train on TPU-8
#
# Achieves 22.2 mAP on COCO17 Val

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 2
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 33
        width: 1024
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 128
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        share_prediction_tower: true
        use_depthwise: true
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_fpn_keras'
      use_depthwise: true
      fpn {
        min_level: 3
        max_level: 7
        additional_layer_depth: 128
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          random_normal_initializer {
            stddev: 0.01
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "/content/models/mymodel/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8/checkpoint/ckpt-0"
  fine_tune_checkpoint_type: "detection"
  batch_size: 16
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 100
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .08
          total_steps: 50000
          warmup_learning_rate: .026666
          warmup_steps: 1000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "/content/labelmap.pbtxt"
  tf_record_input_reader {
    input_path: "/content/train.tfrecord"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}

eval_input_reader: {
  label_map_path: "/content/labelmap.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "/content/val.tfrecord"
  }
}

Is it possible to use this image dimensions for training this model or do I have to make further changes in the config file?

Hi @Mary_Joy, I think as the input is being passed to the hidden layers of the model the data shape might be reducing. Could please try with the images having some more height. Thank You.

1 Like

@Kiran_Sai_Ramineni Thank you for your reply.
It is working for 1024x64.
But is it possible to use 1024x33 by modifying the layer architecture? What all changes should I need to make for this?

Hi @Mary_Joy, Yes by changing the model architecture it is possible to pass the 1024x33.

you have to go through the model layer by layer and need to find at which layer the dimension is being reduced. Thank You.

1 Like

Can you provide some insights on where this model architecture is defined in the tensorflow object detection API. Thank you

Hi @Mary_Joy, you use the model.summary() to get the model architecture and input and output shape for the layers present in the model architecture. Thank You.