Dataset Service extremely slow with Dynamic sharding policy

Hi,

I have a very simple example but setting sharding policy to dynamic vs off makes a huge difference in data speed.

dataset = tf.data.Dataset.range(100)
dataset_id = tf.data.experimental.service.register_dataset(dataset=dataset, service=dispatcher_addr)
dataset = tf.data.experimental.service.from_dataset_id(processing_mode=tf.data.experimental.service.ShardingPolicy.OFF, service=dispatcher_addr, dataset_id=dataset_id)

for x in dataset:
print(x)

is almost instantaneous. But If I change the sharding policy to Dynamic (to ensure at most once visitation) it is horrendously slow even on this simple dataset.

Any idea what Im doing wrong?

Hi @Alykhan_Tejani ,

Welcome to the TensorFlow Forum ,

Reasons for Performance Difference:

  • Sharding Overhead: While ShardingPolicy.OFF treats the entire dataset as a single shard, ShardingPolicy.DYNAMIC automatically splits the data into shards for parallel processing. This adds some overhead for splitting and coordinating access between workers.
  • At-Most-Once Guarantee: With dynamic sharding, the service ensures each element is processed only once by any worker. This might involve additional coordination and bookkeeping compared to OFF mode, potentially impacting speed.
  • Dataset Size and Processing Time: In your example with a small dataset (100 elements), the overhead from sharding might outweigh the benefits. For larger datasets where parallel processing can significantly speed up execution, the dynamic policy might be more advantageous.

If your dataset involves expensive processing per element, even a small dataset can benefit more from ShardingPolicy.OFF due to minimal overhead. On the other hand, if your processing is fast and you have a large dataset, ShardingPolicy.DYNAMIC can bring significant speedups through parallel processing.

You Can refer This Documentation for more details .

Thank You !

Thanks for your response. In general isnt a dynamic policy preferred as you would like an epoch to mean one iteration of all of the data?

Basically Dynamic policy is preferrable where the dataset is large so that it can split the data for parallel
processing .

Hope it helps .