> If you plug your query into https://platform.openai.com/tokenizer, you can see...

wruza · on March 1, 2024

It doesn’t even matter how many tokens there is, because LLMs are completely ignorant about how their input is structured. They don’t see letters or syllables cause they have no “eyes”. The closest analogy with a human is that vocal-ish concepts just emerge in their mind without any visual representation. They can only “recall” how many “e”s are there, but cannot look and count.

alickz · on March 1, 2024

>They can only “recall” how many “e”s are there, but cannot look and count.

Like a blind person?

wruza · on March 1, 2024

My initial analogy was already weak, so I guess there's no point in extending it. They key fact here is that tokens are inputs to what essentially is an overgrown matrix multiplication routine. Everything "AI" happens few levels of scientific abstractions higher, and is semantically disconnected from the "moving parts".

brewtide · on March 1, 2024

Pre-cogs, I knew it.

redox99 · on March 1, 2024

" egregious" (with a leading space) is the single token. Most lower case word tokens start with a space.

ToValueFunfetti · on March 1, 2024

The number of tokens depends on context; if you just entered 'egregious' it will have broken it into three tokens, but with the whole query it's one.

fuzztester · on March 1, 2024

Why three tokens, not one?

317070 · on March 1, 2024

without the leading space, it is not common enough as a word to have become a token in its own right. Like the vast majority of lowercase words, in OpenAIs tokenizer you need to start " egregious" with a space character for the single token.