Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
In Australian English, calm rhymes with farm and uses a long vowel, while com uses a short vowel and would rhyme with prom. (I know this doesn't help much because some American accents also rhyme prom with farm).
It's not 'calm' that differs, it's 'common'. Calm like palm, in all major accents.
Traditionally, calm and com- have different vowels in English, but most North American accents merge com- into calm. All other major English accents retain the distinction.
If you're American, try saying 'com' while rounding your lips. Or just listen to a recording of 'common' in an online dictionary from Britain or Australia. (Or lot, pot, spot, etc.)
I've been thinking about this for a minute, and I think if an American were to say "why", and take only the most open vowel sound from that word and put it between "k" and "m", you get a pretty decent Australian pronunciation. I am an Australian so I could be entirely wrong about how one pronounces "why".
Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
This is what your brain does when it processes language.
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.
A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
I think it's more like, you have a thought X, that has so many dimensions to it, but the way you serialize it to something that's actually discussable and comparable to other thoughts is language. And sometimes that language naturally loves slicing one part of that thought one way or the other.
(then there's also a feedback loop type of argument, that always happens when discussing any sort of perception-reality distinction, but let's ignore that for now)
At least for me, my brain is so bad and it's hard for me to truly hold a single thought in my head for a long time. Maybe it eventually settles into my subconscious but I don't really have a way to verify that.
My experience is that sometimes, for example, when I watch a lecture in a foreign language, there could be some terms for which I don't know the correct translation so I cannot think about or mention them in my native language, while I understand what they mean.
I was more focused on the experience of monolinguals (where this kind of explanation is impossible), but yes I also experience this fairly often as someone who speaks more than one language.
> if you truly did think in the language you speak, how could this situation happen?
As far as how it happens to me is concerned, either something closer to speech than raw thoughts reports back the data in shared memory is invalid for selected language, or I find there's no text representation exist for what I am trying to say.
The "raw" thoughts work with the currently active language, for me, so at least for me, I just know strong Sapir-Whorf hypothesis is not even a hypothesis, but just a reasonable verbalization closely matching my own observations.
I don't get why people can't take it, even in the age of LLMs. It is what it is and that old guy is just never correct even for once.
The way a French friend put it to me when they were learning English through immersion..
I dreamt in English last night! Now I know I can speak the language!
Being monolingual, and then trying to pick up another language later in life, a big struggle is not trying to "map" sentences and structure to what one already knows.
It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
Hard disagree. When I'm reading a transcript, I want word-for-word what the people said, not a creative edit. I want the speakers' voice, not the transcriptionist's.
And when I'm watching subtitles in my own language (say because I want the volume low so I'm not disturbing others), I hate when the words I see don't match the words I hear. It's the quickest way I can imagine to get sucked out of the content and into awareness of the delivery of the content.
Sometimes they're edited down simply for space, because there wouldn't be time to easily read all the dialog otherwise. And sometimes repetition of words or phrases is removed, because it's clearer, and the emphasis is obvious from watching the moving image. And filler words like "uh" or "um" generally aren't included unless they were in the original script.
Most interestingly, swearing is sometimes toned down, just by skipping it -- removing an f-word in a sentence or similar. Not out of any kind of puritanism, but because swear words genuinely come across as more powerful in print than they do in speech. What sounds right when spoken can sometimes look like too much in print.
Subtitles are an art. Determining when to best time them, how to split up long sentences, how to handle different speakers, how to handle repetition, how to handle limited space. I used to want subtitles that were perfectly faithful to what was spoken. Then I actually got involved in making subtitles at one point, and was very surprised to discover that perfectly faithful subtitles didn't actually do the best job of communicating meaning.
Fictional subtitles aren't court transcripts. They serve the purpose of storytelling, which is the combination of a visible moving image full of emotion and action, and the subtitles. Their interplay is complex.
Hard and vehemently disagree.
Subtitles are not commentary tracks.
The artists are the writers, voice actors, and everyone else involved in creating the original media. Never, ever, a random stranger should contaminate it with his/her opinions or point of views.
Subtitles should be perfect transcriptions or the most accurate translations, never reinterpretations
And official subtitles aren't made by random strangers. They're made by people who do it professionally.
It's not "contamination" or "opinions", like somebody is injecting political views! And certainly not "reinterpretation". Goodness. It's about clarity, that's all.
Also there's no such thing as the "most accurate" translations. Translations themselves are an art, hugely.
That's the thing though, subtitles aren't intended as full transcripts. They are intended to allow a wide variety of people to follow the content.
A lot of people read slower than they would hear speech. So subtitles often need to condense or rephrase speech to keep pace with the video. The goal is usually to convey meaning clearly within the time available on screen. Not to capture every single word.
If they tried to be fully verbatim, you'd either have subtitles disappearing before most viewers could finish reading them or large blocks of text covering the screen.
Subtitlers also have to account for things like overlapping dialogue, filler words, and false starts, which can make exact transcriptions harder to read and more distracting in a visual medium.
I mean, yeah in your own native language I agree it sort of sucks if you can still hear the spoken words as well. But, to be frank, you are also the minority group here as far as subtitle target audiences go.
And to be honest, if they were fully verbatim, I'd wager you quickly would be annoyed as well. Simply because you will notice how much attention they then draw, making you less able to actually view the content.
I regularly enable YouTube subtitles. Almost always, they are a 100% verbatim transcription, excluding errors from auto-transcription. I am not annoyed in the slightest, and in fact I very much prefer that they are verbatim.
If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
> If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
And what are deaf people supposed to do in a cinema, or with broadcast TV?
(And I'm ignoring other uses, e.g. learning a foreign language; for that, sometimes you want the exact words, sometimes the gist, but it's highly situational; but even once you've learned the language itself, regional accents even without vocabulary changes can be tough).
> If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
That's just plain tone deaf, plain and simple. I was not talking about myself, or just youtube. You are not everyone else, your use case is not everyone else their use case. It really isn't that difficult.
Aren't same-language subtitles supposed to be perfect literal transcripts, while cross-language subtitling is supposed to be compressed creative interpretations?
I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
Except it forces me to slow down to "decypher" the text and makes the reading labored. I understand the point as it is part of the character, but it is easier to understand someone speaking in that vernacular vs reading the forced misspellings. I definitely don't want to get to the point of being good at reading it though. I wonder if this is how second grade teachers feel reading the class' schoolwork?
That's true. I'm sure Twain and Banks were aware of this, though. Apparently they considered the immersion to be worth a little extra work on the part of the reader. Whether the reader agrees is a different story.
I try to limit my use of it to just enough for my accent and way of talking to bleed through. I don't go for full-on phonetics, but I'm often "droppin' my g's and usin' lotsa regional sayin's." It probably helps that the people I text have the same accent I do, though.
queue
The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!
I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
The right way to do this would be to use longer, overlapping chunks.
E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).
This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.
I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.
If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay?
I don't know an LLM that does context based rewriting of interpreted text.
That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.
The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.
That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.
That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.
When she told me her name, I didn't ask her to repeat it, and I got it right through the rest of the call. Whisper didn't, so how is this "at least s good as a human?"
I wouldn't expect any transcriber to know that the correct spelling in your case used a G rather than a J - the J is far more common in my experience. "Jim" would be an aberration that could be improved, but substitution "Jem" for "Gem" without any context to suggest the latter would be just fine IMO.
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.