I have a input pipeline that I need to update input regularly, so I use TFRecordDataset and thought I just need to update the file to update pipeline. However, it looks like the pipeline auto cache the dataset, but I didn’t use cache() method. Can anyone help me point out what making my pipeline automatically cache dataset?
Below is my pipeline:
ds = tf.data.TFRecordDataset(os.path.join(self.data_path,file_name))
ds = ds.map(self.decode_fn(is_train), num_parallel_calls=tf.data.experimental.AUTOTUNE)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = (
tf.data.experimental.AutoShardPolicy.OFF)
train_dataflow = ds.with_options(options)
train_ds = train_dataflow.repeat().batch(
self.batch_size, drop_remainder=True
).map(
autoaug_batch_process_map_fn,
num_parallel_calls=tf.data.experimental.AUTOTUNE).prefetch(
buffer_size=tf.data.experimental.AUTOTUNE)
train_input_iterator = (
self.strategy.experimental_distribute_dataset(
train_ds).make_initializable_iterator())