You all seem like one of the only companies targeting low-latency inference rath...

tome · on Feb 19, 2024

Yes, because we're one of the only companies whose hardware can actually support low latency! Everyone else is stuck with traditional designs and they try to make up for their high latency by batching to get higher throughput. But not all applications work with high throughput/high latency ... Low latency unlocks feeding the result of one model into the input of another model. Check out this conversational AI demo on CNN. You can't do that kind of thing unless you have low latency.

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

vimarsh6739 · on Feb 19, 2024

Might be a bit out of context, but isn't the TPU also optimized for low latency inference? (Judging by reading the original TPU architecture paper here - https://arxiv.org/abs/1704.04760). If so, does Groq actually provide hardware support for LLM inference?

tome · on Feb 19, 2024

Jonathan Ross on that paper is Groq's founder and CEO. Groq's LPU is an natural continuation of the breakthrough ideas he had when designing Google's TPU.

Could you clarify your question about hardware support? Currently we build out our hardware to support our cloud offering, and we sell systems to enterprise customers.

vimarsh6739 · on Feb 19, 2024

Thanks for the quick reply! About hardware support, I was wondering if the LPU has a hardware instruction to compute the attention matrix similar to the MatrixMultiply/Convolve instruction in the TPU ISA. (Maybe a hardware instruction which fuses a softmax on the matmul epilogue?)

tome · on Feb 19, 2024

We don't have a hardware instruction but we do have some patented technology around using a matrix engine to efficiently calculate other linear algebra operations such as convolution.

mirekrusin · on Feb 19, 2024

Are you considering targeting consumer market? There are a lot of people throwing $2k-$4k into local setups and they primarily care about inference.

tome · on Feb 19, 2024

At the moment we're concentrating on building out our API and serving the enterprise market.