> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.
I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.
For these use cases, time to first byte is the most important metric, not total throughput.
Obviously not op, but these days LLMs can be fuzzy functions with reliably structured output, and are multi-modal.
Think about the implications of that. I bet you can come up with some pretty cool use cases that don't involve you talking to something over chat.
One example:
I think we'll be seeing a lot of "general detectors" soon. Without training or predefined categories, get pinged when (whatever you specify) happens. Whether it's a security camera, web search, event data, etc
I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.
For these use cases, time to first byte is the most important metric, not total throughput.