Why does universal sentence hub model output vary in different calls?

I’m using Google | universal-sentence-encoder | Kaggle to generate sentence embeddings. I notice that if I generate embeddings for an entire dataset in one go, I get slightly different embeddings than if I generate embeddings for that dataset in chunks. The differences tend to be extremely small - usually only in the least significant digit, but they appear consistently and do not seem to be mere floating point rounding errors.

For a pathological example:

embedding_location = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(embedding_location) 
np.concatenate([embed(['abc']).numpy(),embed(['def']).numpy(),embed(['ghi']).numpy()])

Generates the following:
[[ 0.01497091, -0.06938398, 0.02648711, ..., -0.0689584 , -0.03245958, 0.03247434], [ 0.01686596, -0.05855573, 0.06133697, ..., -0.03324953, -0.01865838, 0.01585944], [-0.00416895, -0.01545986, 0.05822454, ..., -0.03863342, 0.08442602, -0.04622617]]

Whereas

embed(['abc','def','ghi']).numpy()

Generates:
[[ 0.01497092, -0.06938398, 0.02648713, ..., -0.0689584 , -0.03245954, 0.03247434], [ 0.01686597, -0.05855573, 0.06133697, ..., -0.03324953, -0.01865837, 0.0158594 ], [-0.00416889, -0.01545987, 0.05822457, ..., -0.03863347, 0.08442603, -0.04622612]]

This appears to be the case even if I explicitly set the model to be non-trainable via:
hub.KerasLayer(embedding_location, input_shape=[], dtype=tf.string, trainable=False)

Is there an explanation? Is there some way to make these two methods align?

Thanks!

Thank you for creating the issue. Can you provide details about the environment (CPU/GPU) that the model was run on? Thank you