1. How many GroqCards are you using to run the Demo?
2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)
3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!
1. I think our GroqChat demo is using 568 GroqChips. I'm not sure exactly, but it's about that number.
2. We're working on our second generation chip. I don't know how much SRAM it has exactly but we don't need to increase the SRAM to get efficient scaling. Our system is deterministic, which means no need for waiting or queuing anywhere, and we can have very low latency interconnect between cards.
3. Yeah absolutely, see this video of a live demo on CNN!
Follow up (noob) question: Are you using a KV cache? That would significantly increase your memory requirements. Or are you forwarding the whole prompt for each auto-regressive pass?
You're welcome! Yes, we have KV cache. Being able to implement this efficiently in terms of hardware requirements and compute time is one of the benefits of our deterministic chip architecture (and deterministic system architecture).
I think currently 1. Unlike with graphics processors, which really need data parallelism to get good throughput, our LPU architecture allows us to deliver good throughput even at batch size 1.
Yeah. And it's a real shame bc even before LLMs got big I was thinking, couple generations down the line and coral would be great for some home automation/edge AI stuff.
Fortunately LLMs and hard work of clever peeps running em on commodity hardware are starting to make this possible anyway.
Because Google Home/Assistant just seems to keep getting dumber and dumber...
You can find out about the chip to chip interconnect from our paper below, section 2.3. I don't think that's custom.
We achieve low latency by basically being a software-defined architecture. Our functional units operate completely orthoganal to each other. We don't have to batch in order to achieve parallelism and the system behaviour is completely deterministic, so we can schedule all operations precisely.
1. How many GroqCards are you using to run the Demo?
2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)
3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!
*Assuming you're using this: https://www.bittware.com/products/groq/