Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how ...

janwas · on May 16, 2024

With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)

celrod · on May 16, 2024

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.