How to prepare object detection data for tf.data?

I’m working on an object detection project, and use tf.data.Dataset input pipeline to load local data. Because object detection requires not only image but also annotations, and the different dimension of annotations makes it even harder. I tried several ways but none of them works. Here’s my attempts, and I’m exhausted of ideas. Very appreciate for your help!

Parse XML

My local data is in Pascal VOC format. First, I used .from_tensor_slices() to get annotation_files paths, and parse them to get image path, and finally .ragged_batch() them. But during .map(load), it automatically converted string into Tensor("args_0:0", shape=(), dtype=string), which cannot be used in many libraries like XML parser ElementTree. Then I used tf.py_function() to convert it back into Python string.
And them find it a TensorFlow bug: Dataset.ragged_batch does not produce correct specs with tf.py_function and tf.numpy_function · Issue #60710 · tensorflow/tensorflow · GitHub

annotation_files = [
    '082f7a7f-IMG_0512.xml',
    '4f4c7f54-IMG_0511.xml',
    '5381454b-IMG_0510.xml',
    '05517884-IMG_0514.xml'
]

def load(annotationFile):
    # load annotation (boxes, class ids)
    def _loadAnnotation(annotationFile):
        thisBoxes = []
        thisClassIDs = []
        annotationFile = annotationFile.numpy().decode("utf-8")
        root = ET.parse(annotationFile).getroot()
        for object in root.findall("object"):
            # load bounding boxes
            bndbox = object.find("bndbox")
            xmin = int(bndbox.find("xmin").text)
            ymin = int(bndbox.find("ymin").text)
            xmax = int(bndbox.find("xmax").text)
            ymax = int(bndbox.find("ymax").text)
            thisBoxes.append([xmin, ymin, xmax, ymax])
            # load class IDs
            className = object.find("name").text
            classID = classNames.index(className)
            thisClassIDs.append(classID)
        # image file path
        imageFile = imageFolder + "/" + root.find('filename').text
        return (imageFile, tf.cast(thisBoxes, dtype=tf.float32), tf.cast(thisClassIDs, dtype=tf.float32))
    
    imageFile, thisBoxes, thisClassIDs = tf.py_function(_loadAnnotation, [annotationFile], [tf.string, tf.float32, tf.float32])
    
    # load image
    image = tf.io.read_file(imageFile)
    image = tf.image.decode_jpeg(image, channels=3)

    # package annotation (boxes, class ids) to dictionary
    bounding_boxes = {
        "boxes": tf.cast(thisBoxes, dtype=tf.float32),
        "classes": tf.cast(thisClassIDs, dtype=tf.float32)
    }

    return {"images": tf.cast(image, dtype=tf.float32), "bounding_boxes": bounding_boxes}

dataset = tf.data.Dataset.from_tensor_slices(annotation_files)
dataset = dataset.map(load)
dataset = dataset.ragged_batch(4)

pickle

Then, I tried to package one record of data into a single file with pickle to prevent parsing with ET. Unfortunately, pickle also needs Python string. Same problem as first attempt, it not works.

TFReocrd

After that, I tried to store data into TFRecord and load them with tf.data.TFRecordDataset(). But problem comes when writing TFRecord. It comes an error TypeError: Value must be iterable. During search I find this discussion. It seems I must reshape my tensor to flatten it, and then bring it back to N-dimension when using. But because the unknown dimension of bounding boxes, which is why I want to use ragged_batch, it’s impossible for me to flatten them.

def serializeTFRecord(data):
    image = data["images"]
    classes = data["bounding_boxes"]["classes"]
    boxes = data["bounding_boxes"]["boxes"]

    feature = {
        "images": tf.train.Feature(float_list=tf.train.FloatList(value=image)),
        "bounding_boxes": {
            "classes": tf.train.Feature(float_list=tf.train.FloatList(value=classes)),
            "boxes": tf.train.Feature(float_list=tf.train.FloatList(value=boxes))
        }
    }

    exampleProto = tf.train.Example(features=tf.train.Features(feature=feature))
    return exampleProto.SerializeToString()

After an arduous attempting and trying, I eventually come with an idea of how to fix this py_function bug. According to this GitHub discussion, the return value of py_function lost the information of its shape and rank. So the easiest solution can bring it back to life is to manually set those information by tensor.set_shape([None, None]), the number of None should be the dimension (the number of axis) of that tensor. Below are a small demo showing that it works.

import tensorflow as tf

def processing(data):
    def _processing(data):
        arr = [range(data), range(data)]
        arr = tf.cast(arr, tf.float32)
        print(f"Inside py_function arr shape: {arr.shape}")
        return arr
    
    arr = tf.py_function(_processing, [data], tf.float32)
    print(f"Outside py_function arr shape: {arr.shape}")
    arr.set_shape([None, None])
    return arr

list = [1,2,3,1]

ds = tf.data.Dataset.from_tensor_slices(list)
ds = ds.map(processing)
ds = ds.ragged_batch(4)

for data in ds:
    print("==========")
    print(data)