I’m writing a machine learning library and one capability I’m trying to support is the ability to represent an entire model contained within a zip file. When loading a tensorflow model and preparing it for training, I need to unpack the weights into a temp directory, then load from there into the keras model. After this the temp directory is removed.
I’m finding when it’s time for training, components like the optimizer are trying to read their weights from the now missing temp directory, which is surprising, because I thought keras would load everything when I called model.load_weights
or tf.train.Checkpoint(mdl).restore(checkpoint_path).expect_partial()
. Aparently, these Op:RestoreV2
ops are being placed in the graph, and then when executed, they try to read from disk. Here’s an example of what I’m getting when fit
is called:
Epoch 5/10000
2022-05-20 16:56:04.538743: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_tensor.cc:182 : NOT_FOUND: Unsuccessful TensorSliceRead
er constructor: Failed to find any matching files for /tmp/tmpicur7bqg/checkpoints/ckpt-1
Exception encountered in context thread! pid: 494291
Traceback (most recent call last):
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 33, in run
super().run()
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 513, in __call__
self.final_call(f, ctx_ret_q, tune_report_q, checkpoint_req_q, checkpoint_ret_q, *args, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 499, in final_call
res = f(*args, **kwargs, tune_report_q=tune_report_q, checkpoint_req_q=checkpoint_req_q, checkpoint_ret_q=checkpoint_ret_q)
File "/home/mkrafcz2/HAL_Projects/DRYML/dense_layer_lib_2.py", line 266, in train_mnist_object
model.train(train_ds, train_spec=train_state, train_callbacks=callbacks, verbose=2, batch_size=32*batch_multiplier)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
res = f(*args, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/dry_pipe.py", line 35, in train
step.train(last_val, *args, train_spec=train_spec, train_callbacks=train_callbacks, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
res = f(*args, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/tf/tf_base.py", line 403, in train
self.train_fn(self, data, *args, train_spec=train_spec, train_callbacks=train_callbacks, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
res = f(*args, **kwargs)
File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/tf/tf_base.py", line 321, in __call__
trainable.model.mdl.fit(
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.NotFoundError: in user code:
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function *
return step_function(self, iterator)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step **
outputs = model.train_step(data)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 532, in minimize
return self.apply_gradients(grads_and_vars, name=name)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 639, in apply_gradients
self._create_all_weights(var_list)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 828, in _create_all_weights
_ = self.iterations
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 835, in __getattribute__
return super(OptimizerV2, self).__getattribute__(name)
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 990, in iterations
self._iterations = self.add_weight(
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 1192, in add_weight
variable = self._add_variable_with_custom_getter(
File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/base_layer_utils.py", line 117, in make_variable
return tf.compat.v1.Variable(
NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /tmp/tmpicur7bqg/checkpoints/ckpt-1 [Op:RestoreV2]
Is there any way to force tensorflow to actually read these values when I load the model? Or am I stuck having to come up with some scheme to persist these weights directories temporarily?