AiN # 22: What are the tokens in LLMs or Foundation Models?

Dr. Chan Naseeb

Global AI Leader | IBM, X-KPMG, X-Accenture. AI Expert | Agentic AI | Generative AI | Multi-Modal & LLMs | MLOps | Responsible AI | Keynote Speaker | Advisor | Consultant | Career Coach

Published Feb 27, 2024

+ Follow

What are the tokens in LLMs or Foundation Models?

Welcome to the Augmented Intelligence Newsletter (AiN) by C. Naseeb.

AiN Issue # 22

Hey, in this issue, I explain the key concepts around tokens and their limitations in LLMs.

In the recent issues, I wrote about LLMs for Timeseries, RAG Pattern, my reflection and highlights for 2023, and the five pillars of Trustworthy AI: Transparency, Explainability, Fairness, Robustness, and Privacy.

What are the tokens?

Tokens can be considered as pieces of words. These tokens are not cropped up precisely where the words start or end - tokens can include trailing spaces and sub-words. Here are some useful rules of thumb for understanding tokens in terms of lengths:

One token ~= 4 chars in English
One token ~= ¾ words
100 tokens ~= 75 words

1-2 sentences ~= 30 tokens
1 paragraph ~= 100 tokens
1,500 words ~= 2048 tokens

There is no rule, but these are guiding principles for tokens. Different models have different token limitations.

Why are they important?

How words are divided into tokens is also language-dependent. For example 'Come Stai' ('How are you' in Italian) possesses three tokens (for nine chars). The higher token-to-char ratio can make it more costly to implement the API for languages other than English.

To further study tokenization, you can use the interactive Tokenizer tool, which estimates the number of tokens and demonstrates how text is broken into tokens. So, the exact tokenization process varies between models. Newer models like GPT-3.5 and GPT-4 use a different tokenizer than earlier models and will beget different tokens for the same input text. Alternatively, if you'd like to tokenize text programmatically, use Tiktoken as a fast BPE tokenizer particularly used for OpenAI models. Examine other models with their own API endpoints or interfaces to interact with them and find out what details they offer.

AiN # 22: What are the tokens in LLMs or Foundation Models?

Dr. Chan Naseeb

Global AI Leader | IBM, X-KPMG, X-Accenture. AI Expert | Agentic AI | Generative AI | Multi-Modal & LLMs | MLOps | Responsible AI | Keynote Speaker | Advisor | Consultant | Career Coach

What are the tokens in LLMs or Foundation Models?

AiN Issue # 22

Recommended by LinkedIn

Token Limits

Exploring tokens

More articles by this author

Insights from the community

Others also viewed

🗃️ GraphRAG Evolves into StructRAG

Top LLM Papers of the Week (November Week 2, 2024)

Watch #4: Ranking the Rankers and a Small Model with Big Results

Towards Advanced RAG

🥇Top ML Papers of the Week

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs 🔥

LLM Paper Reading Notes - May 2024

The LLM Inc

GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal

Explore topics

What are the tokens in LLMs or Foundation Models?

AiN Issue # 22

Recommended by LinkedIn

Token Limits

Exploring tokens

The New AI Economy: A Revolutionary Shift on the Horizon

Dec 2, 2024

AiN # 26: Multi-Agent Systems or Agentic AI

Aug 29, 2024

AiN # 25: Unlocking the Mysteries of ChatGPT: 10 Surprising Facts You Need to Know

Mar 31, 2024

AiN # 24:The Way We Think About The Travel Industry is Changing with AI

Mar 20, 2024

AiN # 23: Power of Generative AI: A Daily Life Revolution

Mar 19, 2024

AiN # 21: Digital innovation ecosystem: connecting different players

Jan 16, 2024

AiN # 20: Timeseries Forecasting: LLMs for Timeseries.

Dec 31, 2023

AiN # 19: Reflection & Highlights for Year 2023

Dec 29, 2023

Augmented Intelligence Newsletter (AiN) # 18: Ready to take your professional and personal life to the next level in 2024?

Dec 19, 2023

Ready to start being paid what you're worth?

Dec 18, 2023

Insights from the community

Others also viewed

🗃️ GraphRAG Evolves into StructRAG

Top LLM Papers of the Week (November Week 2, 2024)

Watch #4: Ranking the Rankers and a Small Model with Big Results

Towards Advanced RAG

🥇Top ML Papers of the Week

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs 🔥

LLM Paper Reading Notes - May 2024

The LLM Inc

GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal

Explore topics