Hi,
I created an image encoder. I want to save the latent vectors in a tfds.Dataset
so I can reuse them later. The basic idea is:
- image → latent vector → save on disk
- load latent vector → use vector in other models
I don’t know if creating a dataset build class will work. So, I want to try tfds.dataset_builders.store_as_tfds_dataset
first.
Here’s the code example:
# preprocess_image output image of shape (128, 128, 3)
(imagenet_ds,) = tfds.load("imagenette/160px-v2", split=["all"])
image_ds = imagenet_ds.map(lambda x: preprocess_image(x["image"])).batch(
256, drop_remainder=False
)
def ds_generator():
for i in image_ds.as_numpy_iterator():
x, *_ = encoder_apply(encoder_state, i, mask_ratio=0.0, rngs=rngs)
yield tf.convert_to_tensor(x, dtype=tf.bfloat16)
image_lantent_ds = tf.data.Dataset.from_generator(
ds_generator,
output_signature=(tf.TensorSpec(shape=(None, 256, 128), dtype=tf.bfloat16)),
)
image_builder = tfds.dataset_builders.store_as_tfds_dataset(
name="image lantent",
version="0.0.1",
features=features.Sequence(
{"latent": features.Tensor(shape=(256, 128), dtype=tf.bfloat16)}, length=256
),
description="imagenet/v2 MAE latent vectors",
config=None,
release_notes={"0.0.1": "Uses mae-imagenette-918_20240606-0236 MAE checkpoint"},
data_dir=image_ds_dir,
split_datasets={"lantent": image_lantent_ds},
)
When I run this code, I got the following error:
TypeError: Failed to encode example:
[...a giant array...]
unhashable type: 'numpy.ndarray'
I think my features
parameter is wrong, but I could not find many examples online. The documentation is very vague.