Hi everybody,
I currently try to benchmark the inference of models when using ONNX, C++ Tensorflow and Ahead-Of-Time (AOT) compilation.
The benchmark itself uses std::chrono
to measure the runtime. To reduce fluctuation I 500 calls of the networks. The inputs are just random generated floats, since i’m not interested in the actual predictions.
Benchmark for DNN:
The benchmark results for simple FeedForward networks are somewhat comparable. All approaches of inference are at least within the same order of magnitude.
Benchmark CNN:
When creating a very simple CNN AOT takes much longer (a factor 100x) to run.
The CNN is fairly simple: The model takes 32 inputs, has 1 channel, kernel size ranges from 1 to 4 and the follow up feedforward network is decent from size (10 layers and 128 units).
My Question is: Why does the AOT network perform so bad. Is there a way to prevent this, for example by utilizing certain XLA flags?
After looking around I found out that this is fairly known problem atleast on GPU:
See here.
I have the feeling that AOT tried to be “clever” and map the convolution kernel in a way which results in a very large number of operations.
For example by reserving a buffer for each movement of the kernel window. This could atleast explain why I found a very large buffer for the filters in the header of the AOT model (x10 as large as a dense layer).
Thanks a lot for your time.
I appreciate this a lot
PS: If you are interested I can also upload the benchmark plots and a picture of the buffers.