I have a general question about how to do warm starting with models using text features, but in the context of TFX. I’ve tried to simplify the real problem/question into something smaller/idealized. If I have:
- A Keras Model
- That uses TFT, including feature(s) that use
compute_and_apply_vocabulary
to transform text into a one-hot encoded representation
How can I do warm start with TFX, assuming that the vocabulary will slightly change over time? I know the following:
base_model
is an input to the trainer component that you can use to load a previous model to do warm starting
- there exists a warmstart_embedding_matrix, which seems built for this purpose
But what isn’t clear is how to correctly connect TFT vocabularies from old models + new models with this warmstart embedding matrix, as the examples seem to assume a simplified scenario, not to mention seems specific to the use of TextVectorization
, which is a bit different than the TFT vocab solutions
TFT doesn’t have a solution for vocab warmstarting out-of-the-box (only implicitly through operating on a rolling range of data), but Trainer may offer a solution for this. Checking now.
Trainer allows you to provide a warm start model (base_model), from which you could extract the old TFT graph and ostensibly the old vocabularies. It seems like warmstart_embedding_matrix is designed to solve the problem of changing vocabulary, which seems appropriate to address TFT, but it’s just not very clear how to wire it up correctly, especially since some TFT vocabulary uses hash buckets (OOV etc)
@rclough_spotify
The tf.keras.utils.warmstart_embedding_matrix function provided by TensorFlow expects an old vocabulary and a new vocabulary. These vocabularies can be in the form of an array/tensor or a text file. The new vocabulary may have different tokens, order, or size compared to the old vocabulary. The order of the tokens is used as the lookup index for the embedding matrix. If you have changing vocabularies during training, you can specify the old and new vocabularies as base_vocabulary
and new_vocabulary
, respectively.
The base_embeddings
parameter represents the currently trained embedding matrix, and new_embeddings_initializer
represents the initialization values for the new embeddings corresponding to the new vocabulary. The utility function will return a remapped embedding matrix, which you need to assign to the embedding matrix of your model’s layer, as demonstrated in the guide. Then, you can resume training your model.