Dataset Service extremely slow with Dynamic sharding policy

Alykhan_Tejani · June 28, 2024, 3:09am

Hi,

I have a very simple example but setting sharding policy to dynamic vs off makes a huge difference in data speed.

dataset = tf.data.Dataset.range(100)
dataset_id = tf.data.experimental.service.register_dataset(dataset=dataset, service=dispatcher_addr)
dataset = tf.data.experimental.service.from_dataset_id(processing_mode=tf.data.experimental.service.ShardingPolicy.OFF, service=dispatcher_addr, dataset_id=dataset_id)

for x in dataset:
print(x)

is almost instantaneous. But If I change the sharding policy to Dynamic (to ensure at most once visitation) it is horrendously slow even on this simple dataset.

Any idea what Im doing wrong?

Aniket_Dubey · July 1, 2024, 7:37am

Hi @Alykhan_Tejani ,

Welcome to the TensorFlow Forum ,

Reasons for Performance Difference:

Sharding Overhead: While ShardingPolicy.OFF treats the entire dataset as a single shard, ShardingPolicy.DYNAMIC automatically splits the data into shards for parallel processing. This adds some overhead for splitting and coordinating access between workers.
At-Most-Once Guarantee: With dynamic sharding, the service ensures each element is processed only once by any worker. This might involve additional coordination and bookkeeping compared to OFF mode, potentially impacting speed.
Dataset Size and Processing Time: In your example with a small dataset (100 elements), the overhead from sharding might outweigh the benefits. For larger datasets where parallel processing can significantly speed up execution, the dynamic policy might be more advantageous.

If your dataset involves expensive processing per element, even a small dataset can benefit more from ShardingPolicy.OFF due to minimal overhead. On the other hand, if your processing is fast and you have a large dataset, ShardingPolicy.DYNAMIC can bring significant speedups through parallel processing.

You Can refer This Documentation for more details .

Thank You !

Alykhan_Tejani · July 1, 2024, 1:02pm

Thanks for your response. In general isnt a dynamic policy preferred as you would like an epoch to mean one iteration of all of the data?

Aniket_Dubey · July 2, 2024, 6:45am

Basically Dynamic policy is preferrable where the dataset is large so that it can split the data for parallel
processing .

Hope it helps .

Topic		Replies	Views
Sharding in Parameter Server Strategy General Discussion distributed-training	0	448	March 17, 2023
Can't get TF Dataset to work with Keras ImageDataGenerator.flow_from_directory() General Discussion datasets , keras , help_request	8	2475	November 29, 2021
What AutoShardPolicy to use for distributed training with multiple workers? General Discussion datasets , help_request	2	498	June 29, 2022
Auto shard policy for tf.keras.utils.Sequence General Discussion api , keras , help_request	1	979	March 15, 2023
Need some help to accelerate data retrieval in training pipeline General Discussion api , datasets	4	29	January 10, 2025

Dataset Service extremely slow with Dynamic sharding policy

Related topics