Variable-sized batches from tf.data.Dataset?

Hi,

Supposing I have a tf.data.Dataset of rank-1 tensors of varying lengths, say up to 100. I would like to create a new dataset in which each element is a 1-d stack of a varying number of consecutive elements from the first dataset, totalling less than length 200.

I know I can do this using tf.data.Dataset.from_generator, but is there a better way?

Thanks in advance,

Henry

Hi @hrbigelow ,

You can achieve this using tf.data.Dataset.window() followed by flat_map().

  • Use window() to create sliding windows of consecutive elements.
  • Set window()'s size parameter to a large value (e.g. 10) and shift=1
  • Apply flat_map() to each window .
  • Inside flat_map() , Stack the tensor and filter based on total length<200

This method is more efficient as it leverages TensorFlow’s built-in dataset operations, potentially offering better performance and parallelism.

Thank You .