More

kwindla · 2025-03-07T17:48:16 1741369696

I've talked about this a lot with friends.

Endpoint detection (and phrase endpointing, and end of utterance) are terms from the academic literature about this, and related, problems.

Very few people who are doing "AI Engineering" or even "Machine Learning" today know these terms. In the past, I argued that we should use the existing academic language rather than invent new terms.

But then OpenAI released the Realtime API and called this "turn detection" in their docs. And that was that. It no longer made sense to use any other verbiage.

mncharity · 2025-03-07T23:43:37 1741391017

Re SEO, I note "utterance" only occurs once, in a perhaps-ephemeral "Things to do" description.

To help with "what is?" and SEO, perhaps something like "Turn detection (aka [...], end of utterance)"... ?

lelag · 2025-03-07T18:33:04 1741372384

Thank for the explanation. I guess it makes some sense, considering many people with no nlp background are using those models now…

kwindla · 2025-03-07T05:12:48 1741324368

A couple of interesting updates today:

- 100ms inference using CoreML: https://x.com/maxxrubin_/status/1897864136698347857

- An LSTM model (1/7th the size) trained on a subset of the data: https://github.com/pipecat-ai/smart-turn/issues/1

kwindla · 2025-03-07T04:46:25 1741322785

It takes about 45 minutes to do the current training run on an L4 GPU with these settings:

    # Training parameters
    "learning_rate": 5e-5,
    "num_epochs": 10,
    "train_batch_size": 12,
    "eval_batch_size": 32,
    "warmup_ratio": 0.2,
    "weight_decay": 0.05,

    # Evaluation parameters
    "eval_steps": 50,
    "save_steps": 50,
    "logging_steps": 5,

    # Model architecture parameters
    "num_frozen_layers": 20

I haven't seen a run do all 10 epochs, recently. There's usually an early stop after about 4 epochs.

The current data set size is ~8,000 samples.

kwindla · 2025-03-07T03:25:39 1741317939

Turn detection is deciding when a person has finished talking and expects the other party in a conversation to respond. In this case, the other party in the conversation is an LLM!

remram · 2025-03-07T03:31:17 1741318277

Oh I see. Not like segmenting a conversation where people speak in turn. Thanks.

password4321 · 2025-03-07T21:28:08 1741382888

Speaker diarization is also still a tough problem for free models.

whiddershins · 2025-03-07T17:04:42 1741367082

huh. how is analyzing conversations in the manner you described NOT the way to train such a model?

remram · 2025-03-07T18:13:18 1741371198

Did you reply to the wrong comment? No one is taking about training here.

kwindla · 2025-03-07T01:44:52 1741311892

Can you say more? There's not much open source work in this domain, that I've been able to find.

I'm particularly interested in architecture variations, approaches to the classification head design and loss function, etc.

kwindla · 2025-03-07T01:44:27 1741311867

580M parameters. More info about the model architecture: https://github.com/pipecat-ai/smart-turn?tab=readme-ov-file#...

cyberbiosecure · 2025-03-07T05:15:06 1741324506

580m, awesome, incredible

meltyness · 2025-03-07T15:45:18 1741362318

... but will the model learn when to interrupt you out of frustration with your ongoing statements, and start shouting?

it seems like for the obvious use-cases there might need to be some sort of limit on how much this component knows

kwindla · on Dec 11, 2024

The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.

If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.

https://github.com/pipecat-ai/gemini-multimodal-live-demo

dandiep · on Dec 11, 2024

Thanks, this is great!

kwindla · on Nov 22, 2024

We've helped a number of Pipecat users hook into a variety of content moderation systems or use LLMs as judges.

The most common approach is to use a `ParallelPipeline` to evaluate the output of the LLM at the same time as the TTS inference is running, then to cancel the output and call a function if a moderation condition is triggered.

Other people have written custom frame processors to make use of the content moderation scoring in the Google and Azure APIs.

If you're interested in building a Pipecat integration for your employer's tech, happy to support that. Feel free to DM me on Twitter.

mooreds · on Nov 23, 2024

Awesome, thanks! Will pass this along.

kwindla · on Oct 2, 2024

There's a really nice implementation of phrase endpointing here:

  https://github.com/pipecat-ai/pipecat/blob/d378e699d23029e8ca7cea7fb675577becd5ebfb/src/pipecat/vad/vad_analyzer.py

It uses three signals as input: silence interval, speech confidence, and audio level.

Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.

Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).

Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)

One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!

com2kid · on Oct 2, 2024

> There's a really nice implementation of phrase endpointing here:

VAD doesn't get you enough accuracy at this level. Confidence is the key bit, how that is done is what makes the experience magic!

kwindla · on Oct 1, 2024

If you're interested in low-latency, multi-modal AI, Tavus is sponsoring a hackathon Oct 19th-20th in SF. (I'm helping to organize it.) There will also be a remote track for people who aren't in SF, so feel free to sign up wherever you are in the world.

https://x.com/kwindla/status/1839767364981920246

kristopolous · on Oct 2, 2024

Hey, I used to work for you a long time ago in a galaxy far away. Nice to hear from you.

kwindla · on Oct 2, 2024

hassaanr · on Oct 1, 2024

Big +1 here! Also shoutout to the Daily team who helped build this!

myprotegeai · on Oct 1, 2024

Can you say more about how developers will use this? Is the api going to be exposed to participants?

hassaanr · on Oct 1, 2024

The API is exposed now, you can signup at tavus.io, and at the hackathon we’ll be giving credits to build!

heroprotagonist · on Oct 1, 2024

Sooo, are you scouting talent and good ideas with this, or is it the kind of hackathon where people give up rights to any IP they produce?

Not to be rude, but these days it's best to ask.

kabirgoel · on Oct 1, 2024

As someone who's attended events run by Daily/Kwindla, I can guarantee that you’ll have fun and leave with your IP rights intact. :) (In fact, I don't even know that they're looking for talent and good ideas... the motivation for organizing these is usually to get people excited about what you're building and create a community you can share things with.)

kwindla · on Oct 1, 2024

What? No. That’s crazy. (I believe you. I’ve just … never heard of giving up IP rights because you participated in a hackathon.)

This is about community and building fun things. I can’t speak for all the sponsors, but what I want is to show people the Open Source tooling we work on at Daily, and see/hear what other people interested in real-time AI are thinking about and working on.

qfavret · on Oct 2, 2024

+1, speaking for the sponsors, exactly what Kwindla said

gavmor · on Oct 2, 2024

> the kind of hackathon where people give up rights to any IP they produce

Wow, I have been attending public hackathons for over a decade, and I have never heard of something like this. That would be an outrage!

heroprotagonist · on Oct 2, 2024

This happens in corporate hackathons. Especially internal ones dreamed up by mid-to-upper management types who wished they worked at a startup.

I had one employer years ago who did a 24 hour thing with a crappy prize. They invited employees to come and do their own idea or join a team, then grind with minimal sleep for a day straight. Starting on a Friday afternoon, of course, so a few hours were on the company dime while everyone else went home early.

If putting in that extra time and effort resulted in anything good, the company might even try to develop it! The employee who came up with it might even get put on that team!

....people actually attended.

gavmor · on Oct 2, 2024

I don't understand why most companies don't just run sensible, reliable, predictable processes like a Design Sprint when they're looking to break out of a local maximum.