Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?


The difference is negligible on 2x 4090. There are more important differences like 4 bit KV cache.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: