Joel Anderson’s Post

View profile for Joel Anderson, graphic

Chief Data Science Officer @ Dig Insights

Many people are talking about how LLMs can't count the number of r's in "strawberry." Most people say it's because LLMs aren't good at math, but that's not the issue. To understand the issue, you need to understand how LLMs work and how the internals interact with the task at hand, which comes down to: 1. Tokenization (how LLMs interpret words) 2. Training data -- Tokenization -- LLMs break words down into tokens. Tokens are small words or parts of larger words, but importantly, they're usually not broken up by letters. Why does this matter? It means they just see a word like strawberry as a single unit as opposed to the letters that make up the word, like we see it. This is efficient for processing (and efficient for training the model). See the screenshot below using the GPT-4 tokenizer. Interestingly, strawberry is a single token when there is a space before it (most of the time) and it is broken into multiple sub-tokens when it either starts a sentence or when a quote comes before it. This tokenization quirk would likely add to its confusion slightly and may be partly why strawberry is picked on. "Blueberry," on the other hand, is broken into "blue" and "berry" tokens in both cases. -- Training data -- It still knows "strawberry" is a compound word made up of the words "straw" and "berry" because that would be in its training data many times (one of those things that just comes up in conversation). But there are too many combinations of words and letters for all combinations to be in the training data enough for LLMs to learn it. Note that LLMs still count the letters in many words accurately; that's because they do have some examples in their training data, and they're pretty good at pattern matching to generalize this to other words.

  • graphical user interface, website

Thanks for this, Joel Anderson. It needed to be addressed. Beyond the technical aspects of tokenizers' mechanics and training data, the key point here is that LLMs tend to double down on validating an incorrect answer once they start down the wrong path—and can be very assertive in the process. The Strawberry example may seem trivial, but it highlights a deeper issue regarding trust in their outputs. We should also explore the types of guardrails, research frameworks, and technical stacks needed to use LLMs effectively while mitigating their quirks and shortcomings. This is a fascinating discussion—thank you for keeping it alive!

To view or add a comment, sign in

Explore topics