Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.
Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd.
This approach was also recommended by the Intel optimization manual -- a great resource.
I just went with straight math for single-precision, though.