I have a few questions about the steps_per_execution
argument in the Keras compile
method:
- Why should this argument not always be set to a very high number?
- What impact does setting
steps_per_execution
to a high number have on memory, CPU, and device resource utilization? - Are there any concerns about model accuracy when using a very high
steps_per_execution
, or will models with differentsteps_per_execution
values always converge to the same metrics? (In contrast, very large batch sizes can negatively impact model performance, as discussed in this discussion and paper.) - For distributed strategies such as
TPUStrategy
, is there any concern about setting a very largesteps_per_execution
? When do the gradient all-reduces happen across pod devices when using largesteps_per_execution
values? Does theoptimizer.apply_gradients
behavior change with largesteps_per_execution
values?