How to properly save and load model weights

Hi,
I’ve been working on the MIT Intro To Deep Learning class and I’ve run into a bug in Lab 1
Part 2.

The supplied code has you define a checkpoint directory in section 2.5

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "my_ckpt")

Later in that section it has you save some weights to it

...
  if iter % 100 == 0:
    model.save_weights(checkpoint_prefix)

# Save the trained model and the weights
model.save_weights(checkpoint_prefix)
experiment.flush()

That fails unless I add ‘weights.h5’ to the end of “my_ckpt”, above.

But once I do that and I get to step 2.6 it failes on
model.load_weights(checkpoint_prefix)
with a request to implement ‘def build_from_config()’

It seems odd that the MIT Deep Learning would have a step that requires me to go edit the libraries. That makes me think that I’m following the wrong trail all together.
Is there something else I might be missing?

solutions guide, for reference introtodeeplearning/lab1/solutions/Part2_Music_Generation_Solution.ipynb at master · aamini/introtodeeplearning · GitHub

Can you paste the exact error

Also the tf version.

From my experience, this problem is caused by the TF version. With TF updation, the default saving format changes. Moreover, the .save_weights() method and the .save() method behave differently, .save() for saving the entire model (architecture + weights), .save_weights() for just the weights (useful for applying to different architectures). If you use the .save_weight() method, you need to reinstance the network. The code in the provided code seems right, have you ever changed the original code?

You can refer to the official TF docs for details: Save and load models

print(tf.keras.__version__)
returns
3.3.3

The full error message is below.

There does seem to be an issue with the versions but I’m confused on how. I’m going under the assumption that it’s possible to get the solutions that MIT provided for the labs to run correctly without having to rewrite their libraries (and add a built_with_config()).

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[41], line 6 2 model = build_model(vocab_size, params[“embedding_dim”], params[“rnn_units”], batch_size=1) 4 # Restore the model weights for the last checkpoint after training 5 model.load_weights(tf.train.latest_checkpoint(checkpoint_dir)) ----> 6 model.load_weights(tf.train.latest_checkpoint(weights_location)) 7 model.load_weights(weights_location) 8 model.build(tf.TensorShape([1, None])) File ~/introtodeeplearning/.venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback..error_handler(*args, **kwargs) 119 filtered_tb = _process_traceback_frames(e.traceback) 120 # To get the full stack trace, call: 121 # keras.config.disable_traceback_filtering()122 raise e.with_traceback(filtered_tb) from None 123 finally: 124 del filtered_tb File ~/introtodeeplearning/.venv/lib/python3.12/site-packages/keras/src/saving/saving_api.py:262, in load_weights(model, filepath, skip_mismatch, **kwargs) 260 legacy_h5_format.load_weights_from_hdf5_group(f, model) 261 else: → 262 raise ValueError( 263 f"File format not supported: filepath={filepath}. " 264 "Keras 3 only supports V3 .keras and .weights.h5 " 265 “files, or legacy V1/V2 .h5 files.” 266 ) ValueError: File format not supported: filepath=None. Keras 3 only supports V3 .keras and .weights.h5 files, or legacy V1/V2 .h5 files.

Possible explanation

The current notebook is probably running TensorFlow 2.16. Why? Because it shows Keras 3.3.3 (installed with TensorFlow 2.16 or 2.15)

When you install TensorFlow 2.16, the Keras version installed under the hood is Keras 3.

Keras 3 is a bit different from Keras 2. Keras 3 is more strict.

Possible solution

Instead, try to install TensorFlow 2.0 like the notebook indicates.

pip install tensorflow==2.0

Now keras should also be downgraded to Keras 2.0 or so.


PS: Keras 3 requires subclassing to be quite strict, and add serialization decorations for custom layers, and even @tf.function decorations. It also requires specific methods to be there like built_with_config() apparently. Also, it wants you to use the .keras format ideally, which is a zipped variant of .h5

Thank you.
That makes sense now.

I tried to run

pip install tensorflow==2.0
and got:

ERROR: Could not find a version that satisfies the requirement tensorflow==2.0 (from versions: 2.16.0rc0, 2.16.1)
ERROR: No matching distribution found for tensorflow==2.0

So then I tried:
/usr/bin/python3.11 -m pip install tensorflow==2.0
and got

Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement tensorflow==2.0 (from versions: 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1, 2.14.0rc0, 2.14.0rc1, 2.14.0, 2.14.1, 2.15.0rc0, 2.15.0rc1, 2.15.0, 2.15.0.post1, 2.15.1, 2.16.0rc0, 2.16.1)
ERROR: No matching distribution found for tensorflow==2.0

and then down the line to the lowest supported version (as per Install TensorFlow with pip):

/usr/bin/python3.10 -m pip install tensorflow==2.0
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement tensorflow==2.0 (from versions: 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1, 2.14.0rc0, 2.14.0rc1, 2.14.0, 2.14.1, 2.15.0rc0, 2.15.0rc1, 2.15.0, 2.15.0.post1, 2.15.1, 2.16.0rc0, 2.16.1)
ERROR: No matching distribution found for tensorflow==2.0
/usr/bin/python3.9 -m pip install tensorflow==2.0
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3/dist-packages/pip/__main__.py", line 22, in <module>
    from pip._internal.cli.main import main as _main
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/main.py", line 10, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/autocompletion.py", line 10, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/main_parser.py", line 9, in <module>
    from pip._internal.build_env import get_runnable_pip
  File "/usr/lib/python3/dist-packages/pip/_internal/build_env.py", line 19, in <module>
    from pip._internal.cli.spinners import open_spinner
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/spinners.py", line 9, in <module>
    from pip._internal.utils.logging import get_indentation
  File "/usr/lib/python3/dist-packages/pip/_internal/utils/logging.py", line 29, in <module>
    from pip._internal.utils.misc import ensure_dir
  File "/usr/lib/python3/dist-packages/pip/_internal/utils/misc.py", line 44, in <module>
    from pip._internal.locations import get_major_minor_version
  File "/usr/lib/python3/dist-packages/pip/_internal/locations/__init__.py", line 66, in <module>
    from . import _distutils
  File "/usr/lib/python3/dist-packages/pip/_internal/locations/_distutils.py", line 20, in <module>
    from distutils.cmd import Command as DistutilsCommand
ModuleNotFoundError: No module named 'distutils.cmd'

It’s kind of encouraging that I’m getting different errors, since that suggests that specifying other python versions might be taking me in the right direction. Is there some other way I can let it know where it can get Tensorflow2.0?

PS I’ve been reading that Keras3 is pretty strict. It seems like that’s a good thing, particularly for bigger and more complex projects. For starters I’d like to just go through something that’s (mostly) known to work.

I think you can try with tensorflow==2.14, or maybe tensorflow==2.14.* (I can’t remember whether pip accepts wildcards.), the important step would be that it uses a similar keras to the one your reference repo used at that time.

A “tested with” table-version with python is here, and may be of use (but python 9-11 should be fine for tf 2.14): Build from source  |  TensorFlow

Official TensorFlow 2.16 + Kares 3.0 Documentation via JARaaS Hybrid RAG - 6/17/2024
Note: Sources at the end of the response.

Based on the information you provided and the sources from the relevant RAG text, here’s a comprehensive guide to address the issue regarding saving and loading model weights in TensorFlow for MIT’s Intro to Deep Learning Lab 1, Part 2:

Troubleshooting Guide

  1. Checkpoint Directory and Prefix:
    Ensure you correctly define the checkpoint directory and prefix:

    checkpoint_dir = './training_checkpoints'
    checkpoint_prefix = os.path.join(checkpoint_dir, "my_ckpt")
    
  2. Saving Weights:
    When saving weights, TensorFlow typically uses the .ckpt extension for checkpoints. The correct approach for saving weights without specifying an extension is:

    model.save_weights(checkpoint_prefix)
    

    However, if you wish to save in the HDF5 format (with .h5 extension), modify the save method as follows:

    model.save_weights(f"{checkpoint_prefix}.h5")
    
  3. Loading Weights:
    When loading weights, ensure that you reference the correct file format that was used to save. For the default TensorFlow checkpoint format, you would use:

    model.load_weights(checkpoint_prefix)
    

    If saved with the .h5 extension, adjust the load method accordingly:

    model.load_weights(f"{checkpoint_prefix}.h5")
    

Addressing the ‘build_from_config()’ Issue

The requirement to implement build_from_config() generally indicates that the model architecture needs to be rebuilt or the checkpoint format isn’t correctly recognized. You should ensure that the model’s architecture is defined before calling model.load_weights().

Example Code Snippet

Here’s a correct way to save and load weights, ensuring the correct format and handling the model appropriately:

# Define the directory and prefix
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "my_ckpt")

# Ensure directory exists
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

# Save weights during training
if iter % 100 == 0:
    model.save_weights(checkpoint_prefix)

# Save the trained model and weights at the end of training
model.save_weights(checkpoint_prefix)

# Loading the weights
model.load_weights(checkpoint_prefix)

# Ensure the model architecture is defined before loading weights
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Continue training or evaluating
model.fit(train_data, train_labels, epochs=10)

Sources:

  • Save and Load Weights: save_and_load.ipynb, save_and_restore_models.ipynb (internal documents)
  • Fault Tolerance and Checkpoints: fault_tolerance.ipynb (internal document)

This should address the issues you’re encountering with the checkpoint system in your deep learning lab.

Thank you for posting such detailed instructions @Tim_Wolfe

I just created a new virtualenv and gave that a try.
I ended up having to use:
model.save_weights(f"{checkpoint_prefix}.weights.h5")
both
model.save_weights(checkpoint_prefix)
and
model.save_weights(f"{checkpoint_prefix}.h5")
returned:
ValueError: The filename must end in .weights.h5. Received: filepath=./training_checkpoints/my_ckpt

After I made that change, everything worked up until the load_weights() line.
I tried to add:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
right before
model.load_weights(f"{checkpoint_prefix}.weights.h5")
The compile step didn’t return any errors but the load_weights still returned:

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[33], [line 9](vscode-notebook-cell:?execution_count=33&line=9) [4](vscode-notebook-cell:?execution_count=33&line=4) # Restore the model weights for the last checkpoint after training [5](vscode-notebook-cell:?execution_count=33&line=5) #model.load_weights(tf.train.latest_checkpoint(checkpoint_dir)) [6](vscode-notebook-cell:?execution_count=33&line=6) #model.load_weights(tf.train.latest_checkpoint(weights_location)) [7](vscode-notebook-cell:?execution_count=33&line=7) #model.load_weights(weights_location) [8](vscode-notebook-cell:?execution_count=33&line=8) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) ----> [9](vscode-notebook-cell:?execution_count=33&line=9) model.load_weights(f"{checkpoint_prefix}.weights.h5") [10](vscode-notebook-cell:?execution_count=33&line=10) model.build(tf.TensorShape([1, None])) [12](vscode-notebook-cell:?execution_count=33&line=12) model.summary() File ~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs) [119](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:119) filtered_tb = _process_traceback_frames(e.__traceback__) [120](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:120) # To get the full stack trace, call: [121](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:121) # `keras.config.disable_traceback_filtering()` --> [122](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:122) raise e.with_traceback(filtered_tb) from None [123](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:123) finally: [124](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py:124) del filtered_tb File ~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:295, in _raise_loading_failure(error_msgs, warn_only) [293](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:293) warnings.warn(msg) [294](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:294) else: --> [295](https://file+.vscode-resource.vscode-cdn.net/home/aprentic/introtodeeplearning/lab1/~/introtodeeplearning/.venv_latest/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:295) raise ValueError(msg)

...

2. You need to implement the `def build_from_config(self, config)` method on layer 'embedding_11', to specify how to rebuild it during loading. In this case, you might also want to implement the method that generates the build config at saving time, `def get_build_config(self)`. The method `build_from_config()` is meant to create the state of the layer (i.e. its variables) upon deserialization. List of objects that could not be loaded: [<Embedding name=embedding_11, built=False>, <LSTMCell name=lstm_cell, built=False>, <Dense name=dense_11, built=False>]

I tried checking Model training APIs to see if I’m missing something after the compile step but came up empty.