How fast can one reasonably expect to get inference on a ~70B model?

mks_shuffle · on May 25, 2024

You can try Groq API for faster inference. They use custom hardware to speed up the inference. Supported open models can be found here: https://console.groq.com/docs/models (includes llama-70b)

yungtriggz · on May 27, 2024

thanks, tried this to some mixed results. seems like they have caps on speed/rate limits etc if you havent spoken to them so might reach out

agr_nyc · on May 25, 2024

We are getting a forward pass time of ~100ms on Meta's original Llama2 70B (float16, batch size 8) PyTorch implementation on 8xA100. Those results are very underwhelming in terms of fully utilizing the GPU flops. If we are doing something wrong, let me know.

The vllm implementation is much faster, I think 50ms or better on either 4 or 8 A100s, forget the exact number.

yungtriggz · on May 27, 2024

yea reckon vllm is the way to go. cheers

kkielhofner · on May 25, 2024

TensorRT-LLM with Triton Inference Server is the fastest in Nvidia land.

https://github.com/triton-inference-server/tensorrtllm_backe...

uptownfunk · on May 25, 2024

Dumb q- have you profiled the inference execution? Where are the bottlenecks you are observing?