Help with tfds.dataset_builders.store_as_tfds_dataset

Hi,

I created an image encoder. I want to save the latent vectors in a tfds.Dataset so I can reuse them later. The basic idea is:

  1. image → latent vector → save on disk
  2. load latent vector → use vector in other models

I don’t know if creating a dataset build class will work. So, I want to try tfds.dataset_builders.store_as_tfds_dataset first.

Here’s the code example:

# preprocess_image output image of shape (128, 128, 3)

(imagenet_ds,) = tfds.load("imagenette/160px-v2", split=["all"])
image_ds = imagenet_ds.map(lambda x: preprocess_image(x["image"])).batch(
    256, drop_remainder=False
)

def ds_generator():
    for i in image_ds.as_numpy_iterator():
        x, *_ = encoder_apply(encoder_state, i, mask_ratio=0.0, rngs=rngs)
        yield tf.convert_to_tensor(x, dtype=tf.bfloat16)


image_lantent_ds = tf.data.Dataset.from_generator(
    ds_generator,
    output_signature=(tf.TensorSpec(shape=(None, 256, 128), dtype=tf.bfloat16)),
)

image_builder = tfds.dataset_builders.store_as_tfds_dataset(
    name="image lantent",
    version="0.0.1",
    features=features.Sequence(
        {"latent": features.Tensor(shape=(256, 128), dtype=tf.bfloat16)}, length=256
    ),
    description="imagenet/v2 MAE latent vectors",
    config=None,
    release_notes={"0.0.1": "Uses mae-imagenette-918_20240606-0236 MAE checkpoint"},
    data_dir=image_ds_dir,
    split_datasets={"lantent": image_lantent_ds},
)

When I run this code, I got the following error:

TypeError: Failed to encode example:
[...a giant array...]
unhashable type: 'numpy.ndarray'

I think my features parameter is wrong, but I could not find many examples online. The documentation is very vague.

Hi @davidshen, This error occurs when you try to hash a numpy array, which is an unhashable object. Like when creating the Python dictionaries the keys present in the Python dictionaries should be a hashable object but when trying to pass the unhashable object as a key value then this type of error will occures. For example,

 arr = np.array( [[ 1, 2, 3],
                 [ 4, 2, 5]] )

my_dict={arr:1,'key1':2}

TypeError: unhashable type: 'numpy.ndarray'

Thank You.