Why bother testing though? I was hoping this topic has finally died recently, bu...

tkgally · 2025-10-14T01:44:48 1760406288

One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.

Later: I asked three LLMs to draft such a test. Gemini’s [1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.

[1] https://g.co/gemini/share/5eefc9aed193

gizmo686 · 2025-10-14T03:15:03 1760411703

What you are testing for is fundamentally different than character level text manipulation.

A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.

However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.

We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:

  un-belie-vably
  dec-entral-ization
  bio-degradable
  mis-understanding
  anti-dis-establishment-arian-ism
  пере-писы-ваться
  pere-pis-y-vat-'-s-ya
  до-сто-примеча-тельность
  do-stop-rime-chat-el-'-nost-'
  пре-по-дава-тель-ница
  бе-зо-т-вет-ственности
  bezotvetstvennosti
  же-лез-нодоро-жный
  z-hele-zn-odoro-zh-ny-y
  食べ-させ-られた-くな-かった
  tab-es-aser-are-tak-unak-atta)
  図書館
  tos-ho-kan
  情報-技術
  j-ō-h-ō- gij-utsu
  国際-関係
  kok-us-ai- kan-kei
  面白-くな-さ-そうだ

Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.

[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.

orbital-decay · 2025-10-14T05:33:04 1760419984

Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.

Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.

tkgally · 2025-10-14T04:05:50 1760414750

Thanks for the explanation. Very interesting.

I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’

curioussquirrel · 2025-10-14T05:16:15 1760418975

Thanks for the explanation and for the tokenizer playground link!

DonHopkins · 2025-10-14T06:51:03 1760424663

inf-ucking-credible

curioussquirrel · 2025-10-14T05:13:37 1760418817

Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).

I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.

On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.

viraptor · 2025-10-14T07:21:36 1760426496

> such as people trying to find meaning of misspelled words.

That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.

curioussquirrel · 2025-10-14T16:35:36 1760459736

True. But does that scale to less common words? Or to other languages than English?

viraptor · 2025-10-15T14:41:40 1760539300

> The phrase "Pweiz mo cco ejst w sprdku zmi?" appears to be a distorted or misspelled version of a Polish sentence. The closest meaningful phrase in Polish is "Powiedz mi co jest w środku ziemi?" which translates to "Tell me what is inside the Earth?"

I'm not sure I could figure out the mangled words there.

neerajsi · 2025-10-14T01:30:55 1760405455

https://www.anthropic.com/news/analysis-tool

Seems like they already built this capability.

redox99 · 2025-10-14T01:44:58 1760406298

Character level LLMs are used for detecting insults and toxic chat in video games and the like.

minimaxir · 2025-10-14T01:49:29 1760406569

Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.

redox99 · 2025-10-14T11:48:41 1760442521

It's something I heard through the grapevine. But there's only a few big enough competitive games where toxicity is such a big deal, so it's not hard to guess.

Character level helps with players disguising insults.

Compute wise it's basically the same, but multiply token count by 4. Which doesn't really matter for short chat in video games.

viraptor · 2025-10-14T07:41:36 1760427696

Yes, for small messages and relatively small scope dictionary, character level will work. But that's very different from what's tested here.

jazzyjackson · 2025-10-14T02:57:50 1760410670

I figure an LLM would be way better at classifying insults than regexing against a bad word list. Why would character level be desirable?

vanviegen · 2025-10-14T05:02:22 1760418142

I'd imagine for simplicity - just skip the tokenizer and feed bytes.

duskwuff · 2025-10-14T06:47:14 1760424434

Might a character-level LLM be better at recognizing poorly spelled (or deliberately misspelled) profanity?

minimaxir · 2025-10-14T01:45:01 1760406301

I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290

> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

It's a subject that the Hacker News bubble and the real world treat differently.

viraptor · 2025-10-14T12:07:10 1760443630

> it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

I know enough PhDs with heavy dyslexia that... no, there's no connection here. You can be a PhD level physicist without being able to spell anything.

brookst · 2025-10-14T02:15:42 1760408142

It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.

It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.

Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.

achierius · 2025-10-14T02:41:02 1760409662

And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.

brookst · 2025-10-15T02:38:36 1760495916

No, this is not what happened.

What any reasonable person expects in "count occurrences of [letter] in [word]" is for a meta-language skill to kick in and actually look at the symbols, not the semantic word. It should count the e's in thee and the w's in willow.

LLMs that use multi-symbol tokenization won't ever be able to do this. The information is lost in the conversion to embeddings. It's like giving you a 2x2 GIF and asking you to count the flowers: 2x2 is sufficient to determine dominant colors, but not fine detail.

Instead, LLMs have been trained on the semantic facts that "strawberry has three r's" and other common tests, just like they're trained that the US has 50 states or motorcycles have two wheels. It's a fact stored in intrinsic knowledge, not a reasoning capability over the symbols the user input (which the actual LLM never sees).

It's not a question of intent or adaptation, it's an information theory principle just like the Nyquist frequency.

vanviegen · 2025-10-14T05:09:12 1760418552

> And yet... now many of them can do it.

Presumably because they trained them to death on this useless test that people somehow just wouldn't shut up about.

minimaxir · 2025-10-14T05:26:00 1760419560

Which is why in the linked post, I test models against both the "r's in strawberries" and the "b's in blueberries" to see if that is the case.

tl;dr the first case had near perfect accuracy as expected for the case if the LLMs were indeed trained on it. The second case did not.

IncreasePosts · 2025-10-14T01:28:52 1760405332

Wouldn't a llm that just tokenized by character be good at it?

typpilol · 2025-10-14T06:03:27 1760421807

I asked this in another thread and it would only be better with unlimited compute and memory.

Because without those, then the llm has to encode way more parameters and way smaller context windows.

In a theoretical world, it would be better, but might not be much better.

curioussquirrel · 2025-10-14T05:22:51 1760419371

Yes, but it would hurt its contextual understanding and effectively reduce the context window several times.

viraptor · 2025-10-14T09:50:08 1760435408

Only in the current most popular architectures. Mamba and RWKV style LLMs may suffer a bit but don't get a reduced context in the same sense.

curioussquirrel · 2025-10-14T16:37:13 1760459833

You're right. There was also an experiment in Meta which tokenized bytes directly and it didn't hurt performance much in very small models.

MountDoom · 2025-10-14T02:06:46 1760407606

I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."

The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

viraptor · 2025-10-14T07:27:15 1760426835

They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.

vanviegen · 2025-10-14T05:05:00 1760418300

> When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?

ACCount37 · 2025-10-14T12:25:04 1760444704

It's not as specific of a skill as you would think. Being both aware of tokenizer limitations and capable of working around them is occasionally useful for real tasks.

_flux · 2025-10-14T12:35:19 1760445319

What tasks would those be, that wouldn't be better served by using e.g. a Python script as a tool, possibly just as component of the complete solution?

ACCount37 · 2025-10-14T13:08:27 1760447307

Off the top of my head: the user wants LLM to help him solve a word puzzle. Think something a bit like Wordle, but less represented in its dataset.

For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.

An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.

_flux · 2025-10-14T13:24:39 1760448279

Surely there are algorithms to more effectively solve Wordles, and many other word puzzles, than LLMs? LLMs could stil be in the loop for generating words: LLM proposes words, deterministic algorithm tell the score according to the rules of the puzzle, or even augment the list by searching adjacent word space; then at some point LLM submits the guess.

Given wordle words are real words, I think this kind of loop could fare pretty well.

ACCount37 · 2025-10-14T14:57:32 1760453852

Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.

As opposed to: the user is a 9 year old girl, and she has this puzzle in a smartphone game, and she can't figure out the answer, and the mom is busy, so she asks the AI, because the AI is never busy.

Now, for a single vaguely Wordle-like puzzle, how many tokens would it take to write and invoke a solver, and how many to just solve it - working around the tokenizer if necessary?

If you had a batch of 9000 puzzle questions, I can easily believe that writing and running a purpose specific solver would be more compute efficient. But if we're dealing with 1 puzzle question, and we're already invoking an LLM to interpret the natural language instructions for it? Nah.

_flux · 2025-10-15T07:10:23 1760512223

> Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.

Weird how you tell that user is not worried about solving the problem efficiently so we might just as well use LLM directly for it, and go to saying how creating a tool might not be efficient either..

And as we know, LLMs are now very good at character-level problems, but are relatively good at making programs; in particular ones for problems we already know of. LLMs might be able to solve Wordles today with straight-up guessing by just adding spaces between the letters and using their very wide vocabulary, but can LLMs solve e.g. word search puzzles at all?

As you say, if there are 9000 puzzle questions, then a solver is a natural choice to due compute efficiency. But it will also answer the question, and do it without errors (here I'm overstating LLM's abilities a bit though; this would certainly not hold true to novel problems). No "Oh what sharp eyes you have! I'll address the error immdiately!" responses from the solver are to be expected, and actually unsolvable puzzles will be identified, not "lied" about. So why not use the solver even for a single instance of the problem?

I think the (training) effort would be much better on teaching LLMs when they should use an algorithm and when they should just use the model. Many use cases are much less complicated and even more easily solved algorithmically than word puzzle solvers as well; they might be e.g. sorting lists by a certain criteria (the list may be augmented by LLM-created additional data first), and for this task as well I'd rather use a deterministic algorithm than one driven by neural networks and randomness.

E.g. Gemini, Mistral and ChatGPT can do this already in some cases: if I ask them to "Calculate sum of primes between 0 and one million.", it looks like all of them created a piece of code to calculate it. Which is exactly what they should do. (The result was correct.)

ACCount37 · 2025-10-15T14:01:45 1760536905

What LLMs are "good at" is kind of up to us. No fundamental reason why they can't be trained for better character manipulation capabilities, among many other things.

There are always tasks that are best solved through direct character manipulation - as there are tasks that are best solved with Python code, constraint solvers or web search. So add one more teachable skill to the pile.

Helps that we're getting better at teaching LLMs skills.