Yes, I was planning to do this back then, but other stuff came up.
There are many different ways in which this simple example can be improved:
- better detection of when speech ends (currently basic adaptive threshold)
- use small LLM for quick response with something generic while big LLM computes
- TTS streaming in chunks or sentences
One of the better OSS versions of such chatbot I think is https://github.com/yacineMTB/talk.
Though probably many other similar projects also exist by now.
I keep wondering if a small LLM can also be used to help detect when the speaker has finished speaking their thought, not just when they've paused speaking.
That works when you know what you’re going to say. A human knows when you’re pausing to think, but have a thought you’re in the middle of expressing. A VAD doesn’t know this and would interrupt when it hears a silence of N seconds; a lightweight LLM would know to keep waiting despite the silence.
And the inverse: the VAD would wait longer than necessary after a person says e.g. "What do you think?", in case they were still in the middle of talking.
- better detection of when speech ends (currently basic adaptive threshold)
- use small LLM for quick response with something generic while big LLM computes
- TTS streaming in chunks or sentences
One of the better OSS versions of such chatbot I think is https://github.com/yacineMTB/talk. Though probably many other similar projects also exist by now.