AiN # 22: What are the tokens in LLMs or Foundation Models?
Copyright @ LDI and DR. Chan Naseeb

AiN # 22: What are the tokens in LLMs or Foundation Models?

What are the tokens in LLMs or Foundation Models?

Welcome to the Augmented Intelligence Newsletter (AiN) by C. Naseeb.

AiN Issue # 22

Hey, in this issue, I explain the key concepts around tokens and their limitations in LLMs.

In the recent issues, I wrote about LLMs for Timeseries, RAG Pattern, my reflection and highlights for 2023, and the five pillars of Trustworthy AI: Transparency, Explainability, Fairness, Robustness, and Privacy.


What are the tokens?

Tokens can be considered as pieces of words. These tokens are not cropped up precisely where the words start or end - tokens can include trailing spaces and sub-words. Here are some useful rules of thumb for understanding tokens in terms of lengths:

  • One token ~= 4 chars in English
  • One token ~= ¾ words
  • 100 tokens ~= 75 words

Or

  • 1-2 sentences ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens

 There is no rule, but these are guiding principles for tokens. Different models have different token limitations.


Why are they important?

How words are divided into tokens is also language-dependent. For example 'Come Stai' ('How are you' in Italian) possesses three tokens (for nine chars). The higher token-to-char ratio can make it more costly to implement the API for languages other than English.


To further study tokenization, you can use the interactive Tokenizer tool, which estimates the number of tokens and demonstrates how text is broken into tokens. So, the exact tokenization process varies between models. Newer models like GPT-3.5 and GPT-4 use a different tokenizer than earlier models and will beget different tokens for the same input text. Alternatively, if you'd like to tokenize text programmatically, use Tiktoken as a fast BPE tokenizer particularly used for OpenAI models. Examine other models with their own API endpoints or interfaces to interact with them and find out what details they offer.


Token Limits

Depending on which model you are using, requests can use up to 128,000 tokens shared between prompt and completion. Some models, like GPT-4 Turbo, have different limits on input and output tokens. Tokens are consumed by the instructions and examples provided as input, as well as by the output that is created for you by the LLM.


There are often creative ways to solve problems within the limit, e.g., condensing your prompt, breaking the text into smaller pieces, etc, which we will cover in great detail in the next issue.


Exploring tokens

The API treats words according to their context in the corpus data. Models take the prompt, convert the input into a list of tokens, process the prompt, and convert the predicted tokens back to the words we see in the response.


What might appear as two identical words to us may be generated into different tokens depending on how they are structured within the text.


In the next article, we'll discuss several techniques to handle the issue of token limitation.

I regularly write and talk about business, technology, digital transformation, and emerging trends.

  1. LinkedIn
  2. Medium
  3. Youtube

Subscribe to this newsletter or click 'Follow' to read my future articles. You'll be able to read the previous issues here. Also, let me know in the comments if you want me to write about a specific topic that interests you.

Enjoy the newsletter! Please help us improve it by sharing it with your network.

Have a nice day! See you soon. - Chan

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics