MeanMetricWrapper produces inconsistent results on multiple runs

Rajesh_V · September 29, 2021, 1:57pm

Hi, I have noticed an inconsistency where wrapping the metric and passing pure lambda produce different results some of time. Google Colab
If you run the last cell multiple times, you will see instances where mean_squared_wrapped and mean_squared_error_fn are not equal to each other.
How can we explain this?

Thanks

Bhack · September 30, 2021, 8:18pm

I think you need to fix your colab:

NameError: name 'custom_mean_squared_error' is not defined

Rajesh_V · September 30, 2021, 8:52pm

Thanks for pointing that. I just updated colab with the missing function

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

Bhack · September 30, 2021, 10:15pm

I suppose that you need to use something like:

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

Rajesh_V · October 1, 2021, 4:56am

I just tried using squared_error_fn (updated colab as well). It gives the same inconsistent results on the first eval (after compile) sometimes.

Bhack · October 1, 2021, 11:59am

I think that you need to maintain both if you want to compare the wrapped one with with the mean_squared_error_fn.

Bhack · October 1, 2021, 1:12pm

Try to run this:

import numpy as np

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)   

tf.random.set_seed(0)
np.random.seed(0)

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def get_compiled_model():
    
    inputs = keras.Input(shape=(784,), name="digits")
    x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
    x = layers.Dense(64, activation="relu", name="dense_2")(x)
    outputs = layers.Dense(10, activation="softmax", name="predictions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam',
              loss=custom_mean_squared_error,
              metrics=['accuracy', mean_squared_wrapped, mean_squared_error_fn])
    return model

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

x_train = x_train.reshape(60000, 784).astype("float32") / 255
x_test = x_test.reshape(10000, 784).astype("float32") / 255

y_train = y_train.astype("float32")
y_test = y_test.astype("float32")

print("TF version: ",tf.__version__)
compiled_model = get_compiled_model()
one_hot_y_train = tf.one_hot(y_train, depth=10)
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))

Rajesh_V · October 1, 2021, 2:15pm

Thanks for taking a look. So the issue happens only on repeated compiles, not on first compile of model.
So I added a for loop in the new colab so that we can run once and see the mismatch. I am not sure if this is expected.

Pasted the same here

import numpy as np

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)   

tf.random.set_seed(0)
np.random.seed(0)

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def get_compiled_model():
    inputs = keras.Input(shape=(784,), name="digits")
    x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
    x = layers.Dense(64, activation="relu", name="dense_2")(x)
    outputs = layers.Dense(10, activation="softmax", name="predictions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam',
              loss=custom_mean_squared_error,
              metrics=['accuracy', mean_squared_wrapped, mean_squared_error_fn])
    return model

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

x_train = x_train.reshape(60000, 784).astype("float32") / 255
x_test = x_test.reshape(10000, 784).astype("float32") / 255

y_train = y_train.astype("float32")
y_test = y_test.astype("float32")

print("TF version: ",tf.__version__)

one_hot_y_train = tf.one_hot(y_train, depth=10)

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    
    metric_index = 2 # mean_squared_wrapped
    if abs(eval1[metric_index] - eval2[metric_index]) > 1e-5:
        print(f"mismatch found in compile: {i}")
        print("eval1: ", eval1)
        print("eval2: ", eval2)

Bhack · October 1, 2021, 2:54pm

Have you tried to move:
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

inside get_compiled_model function scope?

Rajesh_V · October 1, 2021, 2:59pm

I just tried that and it fixed the issue. No more mismatch. Thanks.

So what is the conclusion for this issue? that MeanMetricWrapper has side effects?

Bhack · October 1, 2021, 3:05pm

I don’t know if something is cached internally.

/cc @Scott_Zhu What do you think?

Mark_Daoust · October 1, 2021, 4:11pm

that MeanMetricWrapper has side effects?

something is cached internally.

The MeanMetricWrapper does have state (the running mean) is it possible that it’s just not getting reset correctly?

I have bumped into a similar issue with compile editing the metric object, and then multiple compile calls stacking up the modifications. This feels a little similar.

Bhack · October 1, 2021, 4:25pm

I think it works also with:

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    compiled_model.reset_metrics()

Rajesh_V · October 1, 2021, 4:39pm

Yes, this works

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    compiled_model.reset_metrics()

but this doesn’t

for i in range(3):
    compiled_model = get_compiled_model()
    compiled_model.reset_metrics()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)

Bhack · October 1, 2021, 4:44pm

I suppose that evaluate doesn’t reset at the end of the cycle but just at the beginning:

https://github.com/keras-team/keras/blob/master/keras/engine/training.py#L1531-L1544

Topic		Replies	Views
Keras custom models deteriorates after save and reload General Discussion models , keras , help_request	20	1939	August 19, 2021
Regarding the discrepancy between the loss of test_on_batch () and train_on_batch () after re-compile () General Discussion models , keras , help_request	9	637	September 11, 2021
Unknown metric val_accuracy using Keras Tuner Error General Discussion models , keras , help_request	2	3568	November 23, 2021
Difference between "train_on_batch()" and "test_on_batch()" return values General Discussion api , keras , help_request	12	3571	September 15, 2021
Tensorflow error when working with metrics module General Discussion api , tf-keras	6	1008	April 10, 2023

MeanMetricWrapper produces inconsistent results on multiple runs

Related topics