At this point, is gguf/llama.cpp a more performant solution for unbatched infere... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		lxe on May 15, 2024 \| parent \| context \| favorite \| on: New exponent functions that make SiLU and SoftMax ... At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?

GuuD on May 15, 2024 [–]

The difference is negligible on 2x 4090. There are more important differences like 4 bit KV cache.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact