Best way to choose steps_per_execution?

I have a few questions about the steps_per_execution argument in the Keras compile method:

  1. Why should this argument not always be set to a very high number?
  2. What impact does setting steps_per_execution to a high number have on memory, CPU, and device resource utilization?
  3. Are there any concerns about model accuracy when using a very high steps_per_execution, or will models with different steps_per_execution values always converge to the same metrics? (In contrast, very large batch sizes can negatively impact model performance, as discussed in this discussion and paper.)
  4. For distributed strategies such as TPUStrategy, is there any concern about setting a very large steps_per_execution? When do the gradient all-reduces happen across pod devices when using large steps_per_execution values? Does the optimizer.apply_gradients behavior change with large steps_per_execution values?

Hi @river_shah, Generally steps_per_execution is used to control how many batches of data need to be executed per epoch.

When the steps_per_execution argument is set to a very high number the model takes a long time per epoch and also increases the time for updating the optimizer state. The best way for choosing this value is total no,of samples in the dataset//batch_size. By using this all the data points present in the dataset will be covered.

while training all the batches required per steps will be loaded into the memory which will increase the memory utilization if there are more batches.

with very high steps_per_execution, the gradient accumulation may behave differently, potentially leading to less stable training

All-reduce will happen once all the devices have completed their execution.

Thank You