Fast GeLU approximation

gwiesenekker · November 16, 2025, 3:10pm

I needed a faster version of GeLU for my application. The following approximation with p=0.544790 works quite well:

0.5*x*(1.0+x/sqrt(p+x*x))

We can rewrite it in terms of y = 0.5 * x as:

y+y*y/sqrt(0.25*p+y*y)

which can be implemented efficiently with AVX2 using _mm256_rsqrt_ps (optionally refined with one Newton–Raphson step for better accuracy).

Regards,
GW

Divya_Sree_Kayyuri · December 9, 2025, 9:30am

Hi @gwiesenekker, You’re right. The rephrased approximation you mentioned, GeLU(x)≈y+y*y/sqrt(0.25*p+y*y) , is well-suited as it allows fast implementation with AVX2 using _mm256_rsqrt_ps. Thanks!

Topic		Replies	Views
Unlocking Gemma's Full Potential Gemma feedback	1	249	June 25, 2025
How does the keras' RMSProp fit sooner (in less epochs) than a implementation of the algorithm? Keras github , datasets , keras	1	474	January 27, 2024
Positional Encoding Speedup General Discussion help_request	1	387	July 8, 2021
Tf sqrt computations are different on two diff CPU architectures (Intel and AMD) General Discussion tf-sqrt , help_request	2	1388	December 20, 2022
CUDA __expf intrinsic General Discussion gpu , help_request , tfcore	1	867	September 15, 2021

Fast GeLU approximation

Related topics