Model checkpointing best practices when using `train_step()`

Sayak_Paul · May 30, 2021, 6:42am

Subclassing tf.keras.Model and overrding its train_step() function give us the kind of flexibility we need to control our training loops. It allows for easily plugging in our favorite callbacks, and almost do all kinds of stuff that are readily available during model.fit().

I am wondering how to use the ModelCheckpoint callback in this instance. Consider the following use-case (taken from here):

class SimSiam(tf.keras.Model):
    def __init__(self, encoder, predictor):
        super(SimSiam, self).__init__()
        self.encoder = encoder
        self.predictor = predictor
        self.loss_tracker = tf.keras.metrics.Mean(name="loss")

How do I set up a ModelCheckpoint callback for this one?

Bhack · May 30, 2021, 11:15am

Can you use something like

Training checkpoints | TensorFlow Core ?

Sayak_Paul · May 30, 2021, 11:29am

Of course. But that somehow defeats the purpose of progressive disclosure of complexity IMO. I wanted to able to focus on my training loop and delegate rest of the things to the framework whenever possible.

And the link does not elaboratively suggest a workaround for the use-case I mentioned.

Bhack · May 30, 2021, 1:28pm

This is more that what you are looking for as It Is relative also to write a custom callback in a custom train loop:

But also in your case if you manually populate the CallbackList

I think that you still need to trigger the epoch event in your custom training loop.

Sayak_Paul · May 30, 2021, 1:55pm

I guess for that I might need to discard the train_step() override which I don’t want to do. I will study the links you shared and get back.

Mark_Daoust · June 1, 2021, 1:17pm

Could you clarify the question? Why doesn’t it work to just pass the callback to .fit like you normally would?

Yes. If you’re writing your own training loop, you need to drive the callbacks section using callback list here:

But if you’er using .train_step and .fit, all the callbacks should be driven as normal, no?

Sayak_Paul · June 1, 2021, 1:52pm

Callback list seems to be a good option. I will try it out.

If your subclassed model (where I am overriding train_step()) contains two or more models and if you are passing ModelCheckpoint callback while calling .fit() on the subclassed model the callback would get confused.

Mark_Daoust · June 1, 2021, 3:23pm

That shouldn’t be like that. Models are supposed to be nestable.

…

The problem here is that the callback is defaulting to saving the model in HDF5 format (which apparently requires that to call .fit to set the input shape, and we don’t call fit on the nested mopdels.).

Set save_weights_only=True to save in the tensorflow checkpoint format and then it works.

Sayak_Paul · June 1, 2021, 3:41pm

Okay. Let me test-drive this on the following since it has two networks present in the subclassed model SimSiam:

But I think it should be ambiguous for the callback to determine which network (out of the two) it should serialize, though.

Sayak_Paul · June 1, 2021, 3:52pm

My hunch was totally wrong it seems. It seems to work right off the bat with save_weights_only=True:

Mark_Daoust · June 1, 2021, 4:01pm

But I think it should be ambiguous for the callback to determine which network (out of the two) it should serialize, though.

It saves a checkpoint of the whole SimSiam Model, that captures both of the nested models.

Sayak_Paul · June 1, 2021, 4:14pm

That’s what I noticed. Sorry about the botheration.

Topic		Replies	Views
Spurious metric tracking in ModelCheckpoint General Discussion api , keras , callbacks , help_request	4	611	June 13, 2021
Checkpoint not saved when using ModelCheckpoint with save_freq = 5000 General Discussion api , keras , callbacks , help_request	3	5744	April 25, 2022
ModelCheckpoint fails to format filename if save_freq is used Keras api , callbacks , help_request	1	1882	November 17, 2023
Subclassed model not saved by ModelCheckpoint callback Keras callbacks , model , tensorflow	1	73	June 27, 2025
How to design tf.keras callback to save model predictions for each batch and each epoch General Discussion api , keras , callbacks	1	956	January 13, 2023

Model checkpointing best practices when using `train_step()`

Related topics