I am trying to do autotuning duing the graph optimization phase, inspired by A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers. However, I am having trouble generating GPU kernel and executing it with single HloInstruction. Seeing gemm_algorithm_picker.cc
, I think it is possible to execute the kernel duing the graph optimization phase, but it is hard to find the way to do it.
My question is,
-
Is there a convenient way to generate a gpu kernel with single HloInstruction?
-
Aside from
ExecuteKernelOnStream
, is there easier way to run the kernel? -
On what abstraction does the stream executor run the kernel?
Thank you!