I needed a faster version of GeLU for my application. The following approximation with p=0.544790 works quite well:
0.5*x*(1.0+x/sqrt(p+x*x))
We can rewrite it in terms of y = 0.5 * x as:
y+y*y/sqrt(0.25*p+y*y)
which can be implemented efficiently with AVX2 using _mm256_rsqrt_ps (optionally refined with one Newton–Raphson step for better accuracy).
Regards,
GW