Keras: slow startup for large models due to metric-initialization

dsuess · August 19, 2021, 7:39am

When creating large models (couple thousands nodes) in graph mode, initializing the metrics can take a very long time. The following toy example takes ~30 seconds on my machine (TF 2.6) to start training:

import tensorflow as tf
import numpy as np
from tensorflow.python.keras import backend as K


with K.get_session() as sess:
    print("DEF")
    model = tf.keras.Sequential(
        [tf.keras.layers.Dense(1) for _ in 500]
    )
    print("METRICS")
    metrics = [tf.keras.metrics.Accuracy(str(i)) for i in range(100)]

    print("COMPILE")
    model.compile(loss="mse", metrics=metrics, run_eagerly=False)
    x, y = np.zeros((2, 1000), dtype=np.float32)
    print("FIT")
    model.fit(x=x, y=y)

Most of the startup time is spend in this loop initializing the metrics.

In the actual model I am currently investigating, startup takes ~20 minutes since it’s quite a large model with data loading included in the graph and ~400 metrics. The latter is due to having 4 per-class metrics for ~100 classes. This time quadruples when adding another GPU with MirroredStrategy. What could I do to improve startup time in this case? So far, I’ve tried:

running in eager mode, which works fine on a single GPU, but scaling out is going to be more challenging
Creating one metric-class for all classes so that I only need to register 4 metrics. But it doesn’t seem to be possible for metrics to return arrays.

dsuess · August 20, 2021, 4:24am

Turns out it’s only a problem with Tensorflow 1.x graph mode. Removing the line with K.get_session() as sess: fixes it.

Topic		Replies	Views
Keras model training is slow without "tf.compat.v1.disable_eager_execution()" General Discussion models , keras	6	2667	September 12, 2022
Training start up when calling fit() extremly slow for model General Discussion autoencoder , epoc , tfdata	8	577	March 1, 2024
Why the first epoch of .fit is very slow? General Discussion help_request	1	2356	August 18, 2022
Why `tf.keras.applications` is so slow? General Discussion api , keras , performance , help_request	1	870	July 7, 2021
Multi GPU and TensorFlow MirroredStrategy General Discussion distributed-training , help_request	1	651	October 4, 2024

Keras: slow startup for large models due to metric-initialization

Related topics