I believe I’m closer to figuring it out. Here’s a simpler case:

```
import keras
import numpy as np
import tensorflow as tf
def binary_cross_entropy(y_hat, y):
epsilon = 1e-7
y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
bce = y * np.log(y_hat + epsilon)
bce += (1 - y) * np.log(1 - y_hat + epsilon)
return -np.mean(bce)
train_xs = np.ones((1, 1000)) * 10000
train_ys = np.zeros((1, 1))
tf.random.set_seed(2)
keras_model = keras.Sequential([
keras.layers.Dense(1000, activation="linear"),
keras.layers.Dense(1, activation="sigmoid")
])
keras_model.compile(
loss=keras.losses.binary_crossentropy,
metrics=[]
)
y = keras_model.predict(train_xs)
assert y.shape == train_ys.shape
metrics = keras_model.evaluate(train_xs, train_ys, verbose=0, return_dict=True)
print("Outputs:", y)
print(metrics["loss"])
print(keras.losses.binary_crossentropy(train_ys, y))
print(binary_cross_entropy(y, train_ys))
```

So now I’m using non-random, constant inputs, and my target is zero. I also changed the activation in the hidden layer to linear (it also works with ReLU - just not with sigmoid) I’m computing the BCE in three ways:

`model.evaluate()`

`model.predict()`

and Keras’s implementation of BCE
`model.predict()`

and my own implementation of BCE

(2) and (3) match, but (1) gives a very different value. What surprises me is that the loss should depend only on the outputs of the model, but if I change the constant `1000`

at line 12 to 10000 or 100000, the value of (1) continues to increase, even though the output of the model is constant the whole time (equal to `1`

). For example, when the constant is `1000`

:

```
Outputs: [[1.]]
909.5416870117188
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457
```

when it’s 10000:

```
Outputs: [[1.]]
9095.416015625
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457
```

Notice the output of `model.evaluate()`

was scaled by a factor of 10, just like the inputs to the model. This leads me to the following hypothesis. Presumably, inside `model.evaluate()`

, the model’s predictions are computed using a higher precision than what I’m getting by running `model.predict()`

myself. Because the sigmoid exponentially decays towards 1 at infinity, when I scale the inputs by 10, this scales the inputs to the sigmoid by 10 (because I’m using linear or ReLU activations in the hidden layer), which causes the sigmoid’s output to be, basically, 1 - exp(-10t) instead of 1 - exp(-t). Because BCE is logarithmic, this undoes that and we get a loss which is ten times higher. However, when I do this same calculation in (2) or (3), the output of the model is saturated at `1`

and so nothing changes.

This difference might seem trivial, but I think it actually has dramatic consequences. When I train certain networks using `model.fit()`

, which apparently uses the same higher-precision calculations that `model.evaluate()`

is using, I get 90% test accuracy. At the lower precision of the manual `model.predict()`

method (implemented using another library, namely JAX), in the same scenario I get 50% accuracy! I believe this discrepancy is what’s ultimately behind issue #7171 I raised on the JAX issue tracker (sorry, I’m not allowed to post a link).

**What I would like to know is:** how does `model.evaluate()`

compute the model’s predictions and the BCE loss? I tried debugging it, but the source code is completely opaque to me, jumping through layer after layer of wrappers and compilation steps. How can I, for example, compute my model’s predictions in the same way that `model.evaluate()`

does it, and then compute the loss in the same way that `model.evaluate()`

does it?