Hi folks. I have a cryptic bug I could use some help puzzling out, if anyone has time?
I have a custom Metric function which returns a dict of tensors.
def result(self):
metric_result = {} # dict of tensors to return
# question_weights is dict of named tensors, made in __init__
for weight in self.question_weights.values():
result[weight.name] = weight
return metric_result
When (and only when) using distributed training, I get the error:
.../keras/utils/metrics_utils.py", line 177, in merge_fn_wrapper **
return tf.identity(result)
**TypeError: Expected any non-tensor type, but got a tensor instead.**
I’m using custom code, but it
- Works correctly on one GPU
- Works correctly on multiple GPUs when returning a single tensor (aka metric_result[some_key])
It only fails when returning a dict of tensors on multi-GPU.
A few obvious things I’ve checked:
The Metric is defined with the MirroredStrategy() context manager:
with context_manager:
...
# be careful to define this within the context_manager, so it is also mirrored if on multi-gpu
extra_metrics = [
custom_metrics.LossPerQuestion(
name='loss_per_question',
question_index_groups=schema.question_index_groups
)
]
model.compile(
loss=loss,
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
metrics=extra_metrics
)
tf.print shows the dict looking okay, by eye:
{'questions/question_0_loss:0': 2.99033308,
'questions/question_1_loss:0': 0.723048568,
'questions/question_2_loss:0': 0.846625209,
...
Any idea what I’m missing here?
Using TF 2.10.0 (latest stable)