It uses three signals as input: silence interval, speech confidence, and audio level.
Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.
Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).
Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)
One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!
Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.
Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).
Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)
One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!