A LUT could be a significant performance penalty would it not? Instead of a floa...

danielhanchen · on March 13, 2024

Oh oh was just gonna comment as well, but saw this! I think x86 has like pshufb for LUTs (used them like ages ago, but forgot now :() I think also some game (was it Spiderman) used loads of lookup tables.

The issue with LUTs is don't you have to update the LUT itself? You can select which memory address to load up, but the LUT itself has to be differentiable maybe? TBH I'm not an expert on LUTs.

On fixed point - similarly ye you have to fix the precision ranges as well, so again I'm unsure on how one changes the fixed point numbers over time. I'll have to read more on fixed point.

Maybe 1.58bit using (-1, 0, 1) which gets rid of multiplications and just additions might be more useful, although you'll only get a 2x FLOP boost since you still need fp8 or fp16 addition.

protomolecule · on March 13, 2024

>I think x86 has like pshufb for LUTs

There is also VPERMI2B [0] which operates on a 128 byte LUT.

[0] https://en.wikichip.org/wiki/x86/avx512_vbmi

danielhanchen · on March 13, 2024

Oh I forgot about that!! But ye LUTs are very interesting and fascinating :) One of the hidden gems of CPU optimizations :)