Impact of distribution strategy on keras SavedModel variables size on disk

jasonbrancazio · March 11, 2022, 6:45am

When I save a tf.keras model compiled with a MirroredStrategy on a Vertex AI workbench instance with 4 T4 GPUs attached, the resulting SavedModel variables are 3x the size on disk (1.3GB) of a model compiled on the same instance with the default strategy and then saved (430MB).

If I reload the 1.3GB saved model with the default strategy, then resave it, the variables remain at 1.3GB, instead of shrinking to 430MB.

I don’t even have to train the compiled model to see these differences.

I’ve read the guides and tutorials about model saving and loading, and I’m still struggling to understand why this happens. Can anyone shed some light? Is this known behavior?

Jetti_Bharat · October 7, 2024, 6:34am

Hello @jasonbrancazio
Thank you for using TensorFlow

It seems in the saving model part of the issue, the mirror strategy makes replicas of model in all the GPU’s and update the variables for training, while save it would save all the replicas of model in it that is why the model is almost 3x more than on single GPU. In saving the model and loading the model it is documented about this behavior that After restoring the model, you can continue training on it, even without needing to call Model.compile since it was already compiled before saving, and also that is the reason why it loads the same model again.

Topic		Replies	Views
Effective batch size using tf.distribute.MirroredStrategy Keras distributed-training , keras	3	634	September 19, 2023
Keras/TF model file size vs. PyTorch General Discussion models , keras , tensorflow	2	809	September 13, 2023
MirroredStrategy with jit_compile=True Keras tfdebug , tfkeras , gpu	1	335	February 22, 2024
As computational graph and serialized description of computation General Discussion education , help_request , tfcore	4	1154	November 7, 2021
Multi GPU and TensorFlow MirroredStrategy General Discussion distributed-training , help_request	1	642	October 4, 2024

Impact of distribution strategy on keras SavedModel variables size on disk

Related topics